[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765187#comment-13765187 ] Bikas Saha commented on YARN-540: - I think there is a member variable in AMRMClient that is used to get the ping interval from config. We could use that instead of the hardcoded 100. Sorry for not mentioning it earlier. {code} +LOG.info(Waiting for application to be successfully unregistered.); +Thread.sleep(100); {code} Can we rename isAppRemovedFromStateStore() to isAppSafeToUnregister()? Then we can move the check for unmanagedAM within that method. This way we wont leak unmanagedAM outside RMAppImpl. This transition is invalid and should not be ignored. Its a bug if it happens. {code} +// ignorable transitions +.addTransition(RMAppState.REMOVING, RMAppState.REMOVING, +RMAppEventType.ATTEMPT_UNREGISTERED) {code} Shouldnt the app.isAppRemovalRequestSent flag be checked here since this will typically happen after unregister has already removed the app. How is this working on a single node cluster? Is delete not throwing an exception for non-existent location? {code} + // application completely done and remove from state store. + // App state may be already removed during RMAppFinishingOrRemovingTransition. + RMStateStore store = app.rmContext.getStateStore(); + store.removeApplication(app); {code} What is the YARNApplicationState enum corresponding to AppState.REMOVING? Is MockRMApp never expected to get removed from the store? I would have expected this to return true. {code} + @Override + public boolean isAppRemovedFromStateStore() { +return false; {code} Can RMAppEventType.ATTEMPT_FAILED be received when in REMOVING state (and also when in FINISHING state)? Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1176) RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes
[ https://issues.apache.org/jira/browse/YARN-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765329#comment-13765329 ] Hudson commented on YARN-1176: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #330 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/330/]) YARN-1176. RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes (Jonathan Eagles via tgraves) (tgraves: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522062) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ClusterMetricsInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes -- Key: YARN-1176 URL: https://issues.apache.org/jira/browse/YARN-1176 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.9, 2.1.1-beta Reporter: Thomas Graves Assignee: Jonathan Eagles Priority: Critical Fix For: 3.0.0, 2.3.0, 0.23.10, 2.1.1-beta Attachments: YARN-1176.patch In the web services api for the cluster/metrics, the totalNodes reported doesn't include the unhealthy nodes. this.totalNodes = activeNodes + lostNodes + decommissionedNodes + rebootedNodes; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1184) ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA
J.Andreina created YARN-1184: Summary: ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA Key: YARN-1184 URL: https://issues.apache.org/jira/browse/YARN-1184 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: J.Andreina preemption is enabled. Queue = a,b a capacity = 30% b capacity = 70% Step 1: Assign a big job to queue a ( so that job_a will utilize some resources from queue b) Step 2: Assigne a big job to queue b. Following exception is thrown at Resource Manager {noformat} 2013-09-12 10:42:32,535 ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw an Exception. java.lang.ClassCastException: java.util.Collections$UnmodifiableSet cannot be cast to java.util.NavigableSet at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getContainersToPreempt(ProportionalCapacityPreemptionPolicy.java:403) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:202) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:173) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1176) RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes
[ https://issues.apache.org/jira/browse/YARN-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765352#comment-13765352 ] Hudson commented on YARN-1176: -- FAILURE: Integrated in Hadoop-Hdfs-0.23-Build #728 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/728/]) YARN-1176. RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes (Jonathan Eagles via tgraves) (tgraves: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522069) * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ClusterMetricsInfo.java * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes -- Key: YARN-1176 URL: https://issues.apache.org/jira/browse/YARN-1176 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.9, 2.1.1-beta Reporter: Thomas Graves Assignee: Jonathan Eagles Priority: Critical Fix For: 3.0.0, 2.3.0, 0.23.10, 2.1.1-beta Attachments: YARN-1176.patch In the web services api for the cluster/metrics, the totalNodes reported doesn't include the unhealthy nodes. this.totalNodes = activeNodes + lostNodes + decommissionedNodes + rebootedNodes; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1176) RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes
[ https://issues.apache.org/jira/browse/YARN-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765417#comment-13765417 ] Hudson commented on YARN-1176: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1520 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1520/]) YARN-1176. RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes (Jonathan Eagles via tgraves) (tgraves: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522062) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ClusterMetricsInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes -- Key: YARN-1176 URL: https://issues.apache.org/jira/browse/YARN-1176 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.9, 2.1.1-beta Reporter: Thomas Graves Assignee: Jonathan Eagles Priority: Critical Fix For: 3.0.0, 2.3.0, 0.23.10, 2.1.1-beta Attachments: YARN-1176.patch In the web services api for the cluster/metrics, the totalNodes reported doesn't include the unhealthy nodes. this.totalNodes = activeNodes + lostNodes + decommissionedNodes + rebootedNodes; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1176) RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes
[ https://issues.apache.org/jira/browse/YARN-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765444#comment-13765444 ] Hudson commented on YARN-1176: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1546 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1546/]) YARN-1176. RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes (Jonathan Eagles via tgraves) (tgraves: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522062) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ClusterMetricsInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes -- Key: YARN-1176 URL: https://issues.apache.org/jira/browse/YARN-1176 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 0.23.9, 2.1.1-beta Reporter: Thomas Graves Assignee: Jonathan Eagles Priority: Critical Fix For: 3.0.0, 2.3.0, 0.23.10, 2.1.1-beta Attachments: YARN-1176.patch In the web services api for the cluster/metrics, the totalNodes reported doesn't include the unhealthy nodes. this.totalNodes = activeNodes + lostNodes + decommissionedNodes + rebootedNodes; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765538#comment-13765538 ] Karthik Kambatla commented on YARN-1183: Patch looks mostly good. Few comments: # The ConcurrentHashMap should be initialized with a smaller value for concurrencyLevel (2?), and may be initialCapacity too. # For the keys in the map, we should probably use ApplicationID instead of trackingUrl as the trackingUrl can change by the time the app completes. # Rename waitAppMastersToFinish to waitForAppMastersToFinish? MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1078) TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows
[ https://issues.apache.org/jira/browse/YARN-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-1078: Target Version/s: 3.0.0, 2.1.1-beta Affects Version/s: (was: 2.3.0) 2.1.1-beta TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows - Key: YARN-1078 URL: https://issues.apache.org/jira/browse/YARN-1078 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.1-beta Reporter: Chuan Liu Assignee: Chuan Liu Priority: Minor Attachments: YARN-1078.2.patch, YARN-1078.3.patch, YARN-1078.branch-2.patch, YARN-1078.patch The three unit tests fail on Windows due to host name resolution differences on Windows, i.e. 127.0.0.1 does not resolve to host name localhost. {noformat} org.apache.hadoop.security.token.SecretManager$InvalidToken: Given Container container_0__01_00 identifier is not valid for current Node manager. Expected : 127.0.0.1:12345 Found : localhost:12345 {noformat} {noformat} testNMConnectionToRM(org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater) Time elapsed: 8343 sec FAILURE! org.junit.ComparisonFailure: expected:[localhost]:12345 but was:[127.0.0.1]:12345 at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyResourceTracker6.registerNodeManager(TestNodeStatusUpdater.java:712) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at $Proxy26.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:212) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:149) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyNodeStatusUpdater4.serviceStart(TestNodeStatusUpdater.java:369) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:101) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:213) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNMConnectionToRM(TestNodeStatusUpdater.java:985) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765649#comment-13765649 ] Bikas Saha commented on YARN-1185: -- Yes. Since FileSystem interface does not provide any atomic operations. The RM will not start if there is anything wrong with the stored state. So it some write is partial/empty is will not start. At that point we can judge if the missing piece is important or not and purge that piece and continue. This should be ok for job related data since we only lose a job. However, for global data like secret keys we may have to be more careful. In one case we encode the info in the file name. In other cases, where the data cannot be encoded in the file name, we may have to ensure that the store operation is not partial/empty. For HDFS we may assume atomic rename but will that be true for all filesystems? So we could do the following. Storing app data may continue to be optimistic and since thats the main workload we continue to do what we do today. Storing global data (mainly the security stuff) can change to be more atomic. We can make all store operations more atomic if we feel that we will not slow down the RM because of multiple roundtrips to the store. FileSystemRMStateStore doesn't use temporary files when writing data Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1021: -- Attachment: YARN-1021.patch Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1021: -- Attachment: (was: YARN-1021.pdf) Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1149) NM throws InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FINISHED at RUNNING
[ https://issues.apache.org/jira/browse/YARN-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765722#comment-13765722 ] Xuan Gong commented on YARN-1149: - bq.It creates additional overhead, as the events will be sent every 1s. As the new requests will be blocked, what we need to prevent is that while stopService() is running, startContainers() shouldn't be on the half way, where the application is created but not added into the context. Then, if unfortunately CMgrCompletedAppsEvent is sent at this time, the newly created application will not be notified. Maybe we can use synchronization to prevent stopService() and startContainers() from being interpolated Yes, It does create some additional overhead. Maybe We can make the waiting time a little bit longer. For using synchronization to prevent stopService() and startContainers() from being interpolated, I am thinking that we may have the race condition problem. I think that containermanager should start to serviceStop() immediately without waiting. Imaging that if we have several startContainers() calls, and one serviceStop() call, we can not guarantee when the serviceStop() will be executed. It is very possible that it need to wait for all the startContainers() requests been finished. NM throws InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FINISHED at RUNNING - Key: YARN-1149 URL: https://issues.apache.org/jira/browse/YARN-1149 Project: Hadoop YARN Issue Type: Bug Reporter: Ramya Sunil Assignee: Xuan Gong Fix For: 2.1.1-beta Attachments: YARN-1149.1.patch, YARN-1149.2.patch, YARN-1149.3.patch, YARN-1149.4.patch When nodemanager receives a kill signal when an application has finished execution but log aggregation has not kicked in, InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FINISHED at RUNNING is thrown {noformat} 2013-08-25 20:45:00,875 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:finishLogAggregation(254)) - Application just finished : application_1377459190746_0118 2013-08-25 20:45:00,876 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:uploadLogsForContainer(105)) - Starting aggregate log-file for app application_1377459190746_0118 at /app-logs/foo/logs/application_1377459190746_0118/host_45454.tmp 2013-08-25 20:45:00,876 INFO logaggregation.LogAggregationService (LogAggregationService.java:stopAggregators(151)) - Waiting for aggregation to complete for application_1377459190746_0118 2013-08-25 20:45:00,891 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:uploadLogsForContainer(122)) - Uploading logs for container container_1377459190746_0118_01_04. Current good log dirs are /tmp/yarn/local 2013-08-25 20:45:00,915 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doAppLogAggregation(182)) - Finished aggregate log-file for app application_1377459190746_0118 2013-08-25 20:45:00,925 WARN application.Application (ApplicationImpl.java:handle(427)) - Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FINISHED at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:425) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:59) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:697) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:689) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81) at java.lang.Thread.run(Thread.java:662) 2013-08-25 20:45:00,926 INFO application.Application (ApplicationImpl.java:handle(430)) - Application application_1377459190746_0118 transitioned from RUNNING to null 2013-08-25 20:45:00,927 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(463)) -
[jira] [Commented] (YARN-1166) YARN 'appsFailed' metric should be of type 'counter'
[ https://issues.apache.org/jira/browse/YARN-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765709#comment-13765709 ] Akira AJISAKA commented on YARN-1166: - [~vinodkv], thanks for your comment. I understand, but I don't know how to find the 'final' attempt to increment 'appsFailed' counter. Please tell me how to find it? I attach a patch to decrement the counter by incrementing -1, this is a quick workaround. YARN 'appsFailed' metric should be of type 'counter' Key: YARN-1166 URL: https://issues.apache.org/jira/browse/YARN-1166 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Akira AJISAKA Priority: Blocker Attachments: YARN-1166.2.patch, YARN-1166.3.patch, YARN-1166.patch Currently in YARN's queue metrics, the cumulative metric 'appsFailed' is of type 'guage' - which means the exact value will be reported. All other cumulative queue metrics (AppsSubmitted, AppsCompleted, AppsKilled) are all of type 'counter' - meaning Ganglia will use slope to provide deltas between time-points. To be consistent, AppsFailed metric should also be of type 'counter'. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1021: -- Attachment: YARN-1021.pdf Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1078) TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows
[ https://issues.apache.org/jira/browse/YARN-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765562#comment-13765562 ] Hudson commented on YARN-1078: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4404 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4404/]) YARN-1078. TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows. Contributed by Chuan Liu. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522644) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows - Key: YARN-1078 URL: https://issues.apache.org/jira/browse/YARN-1078 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.3.0 Reporter: Chuan Liu Assignee: Chuan Liu Priority: Minor Attachments: YARN-1078.2.patch, YARN-1078.3.patch, YARN-1078.branch-2.patch, YARN-1078.patch The three unit tests fail on Windows due to host name resolution differences on Windows, i.e. 127.0.0.1 does not resolve to host name localhost. {noformat} org.apache.hadoop.security.token.SecretManager$InvalidToken: Given Container container_0__01_00 identifier is not valid for current Node manager. Expected : 127.0.0.1:12345 Found : localhost:12345 {noformat} {noformat} testNMConnectionToRM(org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater) Time elapsed: 8343 sec FAILURE! org.junit.ComparisonFailure: expected:[localhost]:12345 but was:[127.0.0.1]:12345 at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyResourceTracker6.registerNodeManager(TestNodeStatusUpdater.java:712) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at $Proxy26.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:212) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:149) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyNodeStatusUpdater4.serviceStart(TestNodeStatusUpdater.java:369) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:101) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:213) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNMConnectionToRM(TestNodeStatusUpdater.java:985) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-867) Isolation of failures in aux services
[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-867: --- Attachment: YARN-867.4.patch Isolation of failures in aux services -- Key: YARN-867 URL: https://issues.apache.org/jira/browse/YARN-867 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Critical Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch, YARN-867.sampleCode.2.patch Today, a malicious application can bring down the NM by sending bad data to a service. For example, sending data to the ShuffleService such that it results any non-IOException will cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-867) Isolation of failures in aux services
[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765696#comment-13765696 ] Xuan Gong commented on YARN-867: NEW patch added more transitions in ContainerSEate.EXITED_WITH_FAILURE and ContainerState.DONE. This patch still handles the AuxServicesEventType.APPLICATION_INIT and handles exceptions at the container level. I thought about moving AuxServicesEventType.APPLICATION_INIT into application. But I do not think that we will get any benefits. The reasons are : 1. There are two newly events : AuxServicesEvent.CONTAINER_INIT and AuxServicesEvent.CONTAINER_STOP. We need to handle them at container level. 2. Even if we move AuxServicesEventType.APPLICATION_INIT into application, we will have two options : a. We will not start any containers until all the AuxServices finish their APPLICATION_INIT. If we choose this, that definitely simplify the problem. When there is any exceptions from APPLICATION_INIT on AuxServices, just simply kill the applications. But does it make sense that we need to block all the containers ? b. We can let AuxServices do APPLICATION_INIT and container starts at the same time, if this is the case, we will go to the same process as now. Because, when the container receives the CONTAINER_EXITED_WITH_FAILURE event, we can not guarantee which state the container is, maybe at killing state, LOCALIZED state, etc. Any state is possible. Isolation of failures in aux services -- Key: YARN-867 URL: https://issues.apache.org/jira/browse/YARN-867 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Critical Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch, YARN-867.sampleCode.2.patch Today, a malicious application can bring down the NM by sending bad data to a service. For example, sending data to the ShuffleService such that it results any non-IOException will cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765760#comment-13765760 ] Luke Lu commented on YARN-311: -- I think the v6.2 patch looks reasonable. If there is no further objections, I plan to commit it over the weekend. Dynamic node resource configuration: core scheduler changes --- Key: YARN-311 URL: https://issues.apache.org/jira/browse/YARN-311 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-311-v1.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, YARN-311-v6.2.patch, YARN-311-v6.patch As the first step, we go for resource change on RM side and expose admin APIs (admin protocol, CLI, REST and JMX API) later. In this jira, we will only contain changes in scheduler. The flow to update node's resource and awareness in resource scheduling is: 1. Resource update is through admin API to RM and take effect on RMNodeImpl. 2. When next NM heartbeat for updating status comes, the RMNode's resource change will be aware and the delta resource is added to schedulerNode's availableResource before actual scheduling happens. 3. Scheduler do resource allocation according to new availableResource in SchedulerNode. For more design details, please refer proposal and discussions in parent JIRA: YARN-291. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1166) YARN 'appsFailed' metric should be of type 'counter'
[ https://issues.apache.org/jira/browse/YARN-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765764#comment-13765764 ] Akira AJISAKA commented on YARN-1166: - [~srimanth.gunturi], this fix seems simple. I've already uploaded 'YARN-1166.2.patch' for this. However, if the first ApplicationAttempt fails and the second ApplicationAttempt succeeds, the counter is incremented with this patch. Do you think is that correct behavior? YARN 'appsFailed' metric should be of type 'counter' Key: YARN-1166 URL: https://issues.apache.org/jira/browse/YARN-1166 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Akira AJISAKA Priority: Blocker Attachments: YARN-1166.2.patch, YARN-1166.3.patch, YARN-1166.patch Currently in YARN's queue metrics, the cumulative metric 'appsFailed' is of type 'guage' - which means the exact value will be reported. All other cumulative queue metrics (AppsSubmitted, AppsCompleted, AppsKilled) are all of type 'counter' - meaning Ganglia will use slope to provide deltas between time-points. To be consistent, AppsFailed metric should also be of type 'counter'. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765622#comment-13765622 ] Jason Lowe commented on YARN-1185: -- Also, couldn't it be left with zero-length files if it dies after create but before write can occur? FileSystemRMStateStore doesn't use temporary files when writing data Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765620#comment-13765620 ] Jason Lowe commented on YARN-1185: -- Ah I see. That's an assumption based on a specific FileSystem implementation -- does it only work with HDFS? FileSystemRMStateStore doesn't use temporary files when writing data Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1166) YARN 'appsFailed' metric should be of type 'counter'
[ https://issues.apache.org/jira/browse/YARN-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765716#comment-13765716 ] Srimanth Gunturi commented on YARN-1166: Isnt the fix as simple as changing {{appsFailed}} type to {{MutableCounterInt}}, and removing the below? {code} else { appsFailed.decr(); } {code} This would make appsFailed cumulative, as in {{finishApp(AppSchedulingInfo, RMAppAttemptState)}} we do the {{appsFailed.incr()}} on the final attempt state? YARN 'appsFailed' metric should be of type 'counter' Key: YARN-1166 URL: https://issues.apache.org/jira/browse/YARN-1166 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Akira AJISAKA Priority: Blocker Attachments: YARN-1166.2.patch, YARN-1166.3.patch, YARN-1166.patch Currently in YARN's queue metrics, the cumulative metric 'appsFailed' is of type 'guage' - which means the exact value will be reported. All other cumulative queue metrics (AppsSubmitted, AppsCompleted, AppsKilled) are all of type 'counter' - meaning Ganglia will use slope to provide deltas between time-points. To be consistent, AppsFailed metric should also be of type 'counter'. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765772#comment-13765772 ] Wei Yan commented on YARN-1021: --- Update a new patch according to [~curino]'s comments. Main updates: (1) Let simulator adopt rumen traces directly, using the Rumen Reader (JobStory). Document also updated. (2) Add a testcase that starts the simulator using an example trace. Run 2 min and stop. (3) Fix several minors (LICENSE/NOTICE, javadoc, string appends, etc). For simulating slowstart, headroom and some other behaviors, opened jira YARN-1186. For discreate event-based simulation, opened jira YARN-1187. Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765735#comment-13765735 ] Andrey Klochkov commented on YARN-1183: --- Yep, you're right about trackingUrl being able to change. Please disregard this part. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data
Jason Lowe created YARN-1185: Summary: FileSystemRMStateStore doesn't use temporary files when writing data Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1130) Improve the log flushing for tasks when mapred.userlog.limit.kb is set
[ https://issues.apache.org/jira/browse/YARN-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765870#comment-13765870 ] Paul Han commented on YARN-1130: Patch is submitted. A few notes here: # Added a LogUtils class to facilitate triggering log flush since Log4J's interface such as LogManager doesn't provide a flush() interface # Modified Task class to trigger the flushing of logs before it sends TASK_DONE event to MRMaster. In some cases, TASK_DONE may trigger the container to be killed before all log being written to disk # ContainerLogAppender supports to flush the log when a special message is received. The log flush is done in a synchronous manner with a timeout. This will ensure the invoker of flush waits until logs are written to disk or timeout happens. Improve the log flushing for tasks when mapred.userlog.limit.kb is set -- Key: YARN-1130 URL: https://issues.apache.org/jira/browse/YARN-1130 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Paul Han Fix For: 2.0.5-alpha Attachments: YARN-1130.patch When userlog limit is set with something like this: {code} property namemapred.userlog.limit.kb/name value2048/value descriptionThe maximum size of user-logs of each task in KB. 0 disables the cap. /description /property {code} the log entry will be truncated randomly for the jobs. The log size is left between 1.2MB to 1.6MB. Since the log is already limited, avoid the log truncation is crucial for user. The other issue with the current impl(org.apache.hadoop.yarn.ContainerLogAppender) is that log entries will not flush to file until the container shutdown and logmanager close all appenders. If user likes to see the log during task execution, it doesn't support it. Will propose a patch to add a flush mechanism and also flush the log when task is done. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished
Jason Lowe created YARN-1189: Summary: NMTokenSecretManagerInNM is not being told when applications have finished Key: YARN-1189 URL: https://issues.apache.org/jira/browse/YARN-1189 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Jason Lowe The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765607#comment-13765607 ] Bikas Saha commented on YARN-1185: -- The expectation was that the data being written to each file is so small that it gets sent over completely in the first transfer. So partial writes would not occur. Thats why the code accumulates the write buffer locally and then issues a single write to the FileSystem. FileSystemRMStateStore doesn't use temporary files when writing data Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently
[ https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765891#comment-13765891 ] Hadoop QA commented on YARN-1183: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602875/YARN-1183--n2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1905//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1905//console This message is automatically generated. MiniYARNCluster shutdown takes several minutes intermittently - Key: YARN-1183 URL: https://issues.apache.org/jira/browse/YARN-1183 Project: Hadoop YARN Issue Type: Bug Reporter: Andrey Klochkov Attachments: YARN-1183--n2.patch, YARN-1183.patch As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java processes living for several minutes after successful completion of the corresponding test. There is a concurrency issue in MiniYARNCluster shutdown logic which leads to this. Sometimes RM stops before an app master sends it's last report, and then the app master keeps retrying for 6 minutes. In some cases it leads to failures in subsequent tests, and it affects performance of tests as app masters eat resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1166) YARN 'appsFailed' metric should be of type 'counter'
[ https://issues.apache.org/jira/browse/YARN-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-1166: Attachment: YARN-1166.3.patch YARN 'appsFailed' metric should be of type 'counter' Key: YARN-1166 URL: https://issues.apache.org/jira/browse/YARN-1166 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Srimanth Gunturi Assignee: Akira AJISAKA Priority: Blocker Attachments: YARN-1166.2.patch, YARN-1166.3.patch, YARN-1166.patch Currently in YARN's queue metrics, the cumulative metric 'appsFailed' is of type 'guage' - which means the exact value will be reported. All other cumulative queue metrics (AppsSubmitted, AppsCompleted, AppsKilled) are all of type 'counter' - meaning Ganglia will use slope to provide deltas between time-points. To be consistent, AppsFailed metric should also be of type 'counter'. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator
[ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765758#comment-13765758 ] Hadoop QA commented on YARN-1021: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602846/YARN-1021.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-assemblies hadoop-tools/hadoop-sls hadoop-tools/hadoop-tools-dist. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1903//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1903//console This message is automatically generated. Yarn Scheduler Load Simulator - Key: YARN-1021 URL: https://issues.apache.org/jira/browse/YARN-1021 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful. We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation. The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM. To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler. The simulator will produce real time metrics while executing, including: * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity. * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc). * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits. The simulator will provide real time charts showing the behavior of the scheduler and its performance. A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-867) Isolation of failures in aux services
[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765730#comment-13765730 ] Hadoop QA commented on YARN-867: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602838/YARN-867.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1902//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1902//console This message is automatically generated. Isolation of failures in aux services -- Key: YARN-867 URL: https://issues.apache.org/jira/browse/YARN-867 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Critical Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch, YARN-867.sampleCode.2.patch Today, a malicious application can bring down the NM by sending bad data to a service. For example, sending data to the ShuffleService such that it results any non-IOException will cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1190) enabling uber mde with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb
Siqi Li created YARN-1190: - Summary: enabling uber mde with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb Key: YARN-1190 URL: https://issues.apache.org/jira/browse/YARN-1190 Project: Hadoop YARN Issue Type: Bug Reporter: Siqi Li Priority: Minor -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1190) enabling uber mde with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb
[ https://issues.apache.org/jira/browse/YARN-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-1190: -- Description: Since there is no reducer, the memory allocated to reducer is irrelevant to enable uber mode of a job enabling uber mde with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb Key: YARN-1190 URL: https://issues.apache.org/jira/browse/YARN-1190 Project: Hadoop YARN Issue Type: Bug Reporter: Siqi Li Priority: Minor Since there is no reducer, the memory allocated to reducer is irrelevant to enable uber mode of a job -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1188) The context of QueueMetrics becomes 'default' when using FairScheduler
Akira AJISAKA created YARN-1188: --- Summary: The context of QueueMetrics becomes 'default' when using FairScheduler Key: YARN-1188 URL: https://issues.apache.org/jira/browse/YARN-1188 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Akira AJISAKA Priority: Minor Fix For: 2.3.0, 3.0.0 I found the context of QueueMetrics changed to 'default' from 'yarn' when I was using FairScheduler. The context should always be 'yarn' by adding an annotation to FSQueueMetrics like below: {code} + @Metrics(context=yarn) public class FSQueueMetrics extends QueueMetrics { {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch
[ https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1191: Attachment: YARN-1191-1.patch Attaching patch. Thanks, Mayank [YARN-321] Compilation is failing for YARN-321 branch - Key: YARN-1191 URL: https://issues.apache.org/jira/browse/YARN-1191 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1191-1.patch Compilation is failing for YARN-321 branch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch
Mayank Bansal created YARN-1191: --- Summary: [YARN-321] Compilation is failing for YARN-321 branch Key: YARN-1191 URL: https://issues.apache.org/jira/browse/YARN-1191 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen Like YARN-978, we need some client-oriented class to expose the container history info. Neither Container nor RMContainer is the right one. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch
[ https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1191: Description: Compilation is failing for YARN-321 branch was:Like YARN-978, we need some client-oriented class to expose the container history info. Neither Container nor RMContainer is the right one. [YARN-321] Compilation is failing for YARN-321 branch - Key: YARN-1191 URL: https://issues.apache.org/jira/browse/YARN-1191 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen Compilation is failing for YARN-321 branch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch
[ https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766005#comment-13766005 ] Hadoop QA commented on YARN-1191: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602903/YARN-1191-1.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1906//console This message is automatically generated. [YARN-321] Compilation is failing for YARN-321 branch - Key: YARN-1191 URL: https://issues.apache.org/jira/browse/YARN-1191 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1191-1.patch Compilation is failing for YARN-321 branch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch
[ https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal reassigned YARN-1191: --- Assignee: Mayank Bansal (was: Zhijie Shen) [YARN-321] Compilation is failing for YARN-321 branch - Key: YARN-1191 URL: https://issues.apache.org/jira/browse/YARN-1191 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Compilation is failing for YARN-321 branch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-1157) ResourceManager UI has invalid tracking URL link for distributed shell application
[ https://issues.apache.org/jira/browse/YARN-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-1157: --- Assignee: Xuan Gong ResourceManager UI has invalid tracking URL link for distributed shell application -- Key: YARN-1157 URL: https://issues.apache.org/jira/browse/YARN-1157 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Tassapol Athiapinya Assignee: Xuan Gong Fix For: 2.1.1-beta Submit YARN distributed shell application. Goto ResourceManager Web UI. The application definitely appears. In Tracking UI column, there will be history link. Click on that link. Instead of showing application master web UI, HTTP error 500 would appear. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766028#comment-13766028 ] Mayank Bansal commented on YARN-978: If we can add the tracking url that would be useful in UI Thanks, Mayank [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation -- Key: YARN-978 URL: https://issues.apache.org/jira/browse/YARN-978 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Xuan Gong Fix For: YARN-321 Attachments: YARN-978-1.patch, YARN-978.2.patch, YARN-978.3.patch, YARN-978.4.patch, YARN-978.5.patch, YARN-978.6.patch, YARN-978.7.patch We dont have ApplicationAttemptReport and Protobuf implementation. Adding that. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-540: - Attachment: YARN-540.7.patch Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766032#comment-13766032 ] Jian He commented on YARN-540: -- Thanks for the detailed comments, upload a new patch bq. there is a member variable in AMRMClient that is used to get the ping interval from config Turns out we have one in AMRMClientAsync but not in AMRMClientImpl, bq. Is delete not throwing an exception for non-existent location? Delete throws exception in case of not-existing bq. Can RMAppEventType.ATTEMPT_FAILED be received when in REMOVING state (and also when in FINISHING state)? Once we moved to REMOVING/FINISHING state, it indicates attempt goes to FINISHING state, it should not be possible to generate RMAppEventType.ATTEMPT_FAILED event at that state bq. What is the YARNApplicationState enum corresponding to AppState.REMOVING? In case of REMOVING, return YARNApplicationState as RUNNING, makes sense? Addressed other comments also. Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766040#comment-13766040 ] Hadoop QA commented on YARN-540: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602910/YARN-540.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1907//console This message is automatically generated. Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, YARN-540.patch, YARN-540.patch When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1188) The context of QueueMetrics becomes 'default' when using FairScheduler
[ https://issues.apache.org/jira/browse/YARN-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-1188: Target Version/s: 2.3.0 The context of QueueMetrics becomes 'default' when using FairScheduler -- Key: YARN-1188 URL: https://issues.apache.org/jira/browse/YARN-1188 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Akira AJISAKA Priority: Minor Labels: metrics, newbie I found the context of QueueMetrics changed to 'default' from 'yarn' when I was using FairScheduler. The context should always be 'yarn' by adding an annotation to FSQueueMetrics like below: {code} + @Metrics(context=yarn) public class FSQueueMetrics extends QueueMetrics { {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1188) The context of QueueMetrics becomes 'default' when using FairScheduler
[ https://issues.apache.org/jira/browse/YARN-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-1188: Fix Version/s: (was: 2.3.0) (was: 3.0.0) The context of QueueMetrics becomes 'default' when using FairScheduler -- Key: YARN-1188 URL: https://issues.apache.org/jira/browse/YARN-1188 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Akira AJISAKA Priority: Minor Labels: metrics, newbie I found the context of QueueMetrics changed to 'default' from 'yarn' when I was using FairScheduler. The context should always be 'yarn' by adding an annotation to FSQueueMetrics like below: {code} + @Metrics(context=yarn) public class FSQueueMetrics extends QueueMetrics { {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira