[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765187#comment-13765187
 ] 

Bikas Saha commented on YARN-540:
-


I think there is a member variable in AMRMClient that is used to get the ping 
interval from config. We could use that instead of the hardcoded 100. Sorry for 
not mentioning it earlier.
{code}
+LOG.info(Waiting for application to be successfully unregistered.);
+Thread.sleep(100);
{code}

Can we rename isAppRemovedFromStateStore() to isAppSafeToUnregister()? Then we 
can move the check for unmanagedAM within that method. This way we wont leak 
unmanagedAM outside RMAppImpl.

This transition is invalid and should not be ignored. Its a bug if it happens.
{code}
+// ignorable transitions
+.addTransition(RMAppState.REMOVING, RMAppState.REMOVING,
+RMAppEventType.ATTEMPT_UNREGISTERED)
{code}

Shouldnt the app.isAppRemovalRequestSent flag be checked here since this will 
typically happen after unregister has already removed the app. How is this 
working on a single node cluster? Is delete not throwing an exception for 
non-existent location?
{code}
+  // application completely done and remove from state store.
+  // App state may be already removed during 
RMAppFinishingOrRemovingTransition.
+  RMStateStore store = app.rmContext.getStateStore();
+  store.removeApplication(app);
{code}

What is the YARNApplicationState enum corresponding to AppState.REMOVING?

Is MockRMApp never expected to get removed from the store? I would have 
expected this to return true.
{code}
+  @Override
+  public boolean isAppRemovedFromStateStore() {
+return false;
{code}

Can RMAppEventType.ATTEMPT_FAILED be received when in REMOVING state (and also 
when in FINISHING state)?



 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.patch, 
 YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1176) RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes

2013-09-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765329#comment-13765329
 ] 

Hudson commented on YARN-1176:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #330 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/330/])
YARN-1176. RM web services ClusterMetricsInfo total nodes doesn't include 
unhealthy nodes (Jonathan Eagles via tgraves) (tgraves: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522062)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ClusterMetricsInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java


 RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes
 --

 Key: YARN-1176
 URL: https://issues.apache.org/jira/browse/YARN-1176
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.9, 2.1.1-beta
Reporter: Thomas Graves
Assignee: Jonathan Eagles
Priority: Critical
 Fix For: 3.0.0, 2.3.0, 0.23.10, 2.1.1-beta

 Attachments: YARN-1176.patch


 In the web services api for the cluster/metrics, the totalNodes reported 
 doesn't include the unhealthy nodes.
 this.totalNodes = activeNodes + lostNodes + decommissionedNodes
   + rebootedNodes;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1184) ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA

2013-09-12 Thread J.Andreina (JIRA)
J.Andreina created YARN-1184:


 Summary: ClassCastException is thrown during preemption When a 
huge job is submitted to a queue B whose resources is used by a job in queueA
 Key: YARN-1184
 URL: https://issues.apache.org/jira/browse/YARN-1184
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: J.Andreina


preemption is enabled.
Queue = a,b
a capacity = 30%
b capacity = 70%

Step 1: Assign a big job to queue a ( so that job_a will utilize some resources 
from queue b)
Step 2: Assigne a big job to queue b.

Following exception is thrown at Resource Manager
{noformat}
2013-09-12 10:42:32,535 ERROR [SchedulingMonitor 
(ProportionalCapacityPreemptionPolicy)] yarn.YarnUncaughtExceptionHandler 
(YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw 
an Exception.
java.lang.ClassCastException: java.util.Collections$UnmodifiableSet cannot be 
cast to java.util.NavigableSet
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getContainersToPreempt(ProportionalCapacityPreemptionPolicy.java:403)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:202)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:173)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82)
at java.lang.Thread.run(Thread.java:662)

{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1176) RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes

2013-09-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765352#comment-13765352
 ] 

Hudson commented on YARN-1176:
--

FAILURE: Integrated in Hadoop-Hdfs-0.23-Build #728 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/728/])
YARN-1176. RM web services ClusterMetricsInfo total nodes doesn't include 
unhealthy nodes (Jonathan Eagles via tgraves) (tgraves: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522069)
* /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ClusterMetricsInfo.java
* 
/hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java


 RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes
 --

 Key: YARN-1176
 URL: https://issues.apache.org/jira/browse/YARN-1176
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.9, 2.1.1-beta
Reporter: Thomas Graves
Assignee: Jonathan Eagles
Priority: Critical
 Fix For: 3.0.0, 2.3.0, 0.23.10, 2.1.1-beta

 Attachments: YARN-1176.patch


 In the web services api for the cluster/metrics, the totalNodes reported 
 doesn't include the unhealthy nodes.
 this.totalNodes = activeNodes + lostNodes + decommissionedNodes
   + rebootedNodes;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1176) RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes

2013-09-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765417#comment-13765417
 ] 

Hudson commented on YARN-1176:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1520 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1520/])
YARN-1176. RM web services ClusterMetricsInfo total nodes doesn't include 
unhealthy nodes (Jonathan Eagles via tgraves) (tgraves: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522062)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ClusterMetricsInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java


 RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes
 --

 Key: YARN-1176
 URL: https://issues.apache.org/jira/browse/YARN-1176
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.9, 2.1.1-beta
Reporter: Thomas Graves
Assignee: Jonathan Eagles
Priority: Critical
 Fix For: 3.0.0, 2.3.0, 0.23.10, 2.1.1-beta

 Attachments: YARN-1176.patch


 In the web services api for the cluster/metrics, the totalNodes reported 
 doesn't include the unhealthy nodes.
 this.totalNodes = activeNodes + lostNodes + decommissionedNodes
   + rebootedNodes;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1176) RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes

2013-09-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765444#comment-13765444
 ] 

Hudson commented on YARN-1176:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1546 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1546/])
YARN-1176. RM web services ClusterMetricsInfo total nodes doesn't include 
unhealthy nodes (Jonathan Eagles via tgraves) (tgraves: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522062)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ClusterMetricsInfo.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServices.java


 RM web services ClusterMetricsInfo total nodes doesn't include unhealthy nodes
 --

 Key: YARN-1176
 URL: https://issues.apache.org/jira/browse/YARN-1176
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 0.23.9, 2.1.1-beta
Reporter: Thomas Graves
Assignee: Jonathan Eagles
Priority: Critical
 Fix For: 3.0.0, 2.3.0, 0.23.10, 2.1.1-beta

 Attachments: YARN-1176.patch


 In the web services api for the cluster/metrics, the totalNodes reported 
 doesn't include the unhealthy nodes.
 this.totalNodes = activeNodes + lostNodes + decommissionedNodes
   + rebootedNodes;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765538#comment-13765538
 ] 

Karthik Kambatla commented on YARN-1183:


Patch looks mostly good. Few comments:
# The ConcurrentHashMap should be initialized with a smaller value for 
concurrencyLevel (2?), and may be initialCapacity too.
# For the keys in the map, we should probably use ApplicationID instead of 
trackingUrl as the trackingUrl can change by the time the app completes.
# Rename waitAppMastersToFinish to waitForAppMastersToFinish?

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1078) TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows

2013-09-12 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-1078:


 Target Version/s: 3.0.0, 2.1.1-beta
Affects Version/s: (was: 2.3.0)
   2.1.1-beta

 TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater 
 fail on Windows
 -

 Key: YARN-1078
 URL: https://issues.apache.org/jira/browse/YARN-1078
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor
 Attachments: YARN-1078.2.patch, YARN-1078.3.patch, 
 YARN-1078.branch-2.patch, YARN-1078.patch


 The three unit tests fail on Windows due to host name resolution differences 
 on Windows, i.e. 127.0.0.1 does not resolve to host name localhost.
 {noformat}
 org.apache.hadoop.security.token.SecretManager$InvalidToken: Given Container 
 container_0__01_00 identifier is not valid for current Node manager. 
 Expected : 127.0.0.1:12345 Found : localhost:12345
 {noformat}
 {noformat}
 testNMConnectionToRM(org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater)
   Time elapsed: 8343 sec   FAILURE!
 org.junit.ComparisonFailure: expected:[localhost]:12345 but 
 was:[127.0.0.1]:12345
   at org.junit.Assert.assertEquals(Assert.java:125)
   at org.junit.Assert.assertEquals(Assert.java:147)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyResourceTracker6.registerNodeManager(TestNodeStatusUpdater.java:712)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at $Proxy26.registerNodeManager(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:212)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:149)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyNodeStatusUpdater4.serviceStart(TestNodeStatusUpdater.java:369)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:101)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:213)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNMConnectionToRM(TestNodeStatusUpdater.java:985)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data

2013-09-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765649#comment-13765649
 ] 

Bikas Saha commented on YARN-1185:
--

Yes. Since FileSystem interface does not provide any atomic operations.
The RM will not start if there is anything wrong with the stored state. So it 
some write is partial/empty is will not start. At that point we can judge if 
the missing piece is important or not and purge that piece and continue. This 
should be ok for job related data since we only lose a job. However, for global 
data like secret keys we may have to be more careful. In one case we encode the 
info in the file name. In other cases, where the data cannot be encoded in the 
file name, we may have to ensure that the store operation is not partial/empty. 
For HDFS we may assume atomic rename but will that be true for all filesystems?

So we could do the following. 
Storing app data may continue to be optimistic and since thats the main 
workload we continue to do what we do today.
Storing global data (mainly the security stuff) can change to be more atomic.

We can make all store operations more atomic if we feel that we will not slow 
down the RM because of multiple roundtrips to the store.


 FileSystemRMStateStore doesn't use temporary files when writing data
 

 Key: YARN-1185
 URL: https://issues.apache.org/jira/browse/YARN-1185
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe

 FileSystemRMStateStore writes directly to the destination file when storing 
 state. However if the RM were to crash in the middle of the write, the 
 recovery method could encounter a partially-written file and either outright 
 crash during recovery or silently load incomplete state.
 To avoid this, the data should be written to a temporary file and renamed to 
 the destination file afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1021) Yarn Scheduler Load Simulator

2013-09-12 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1021:
--

Attachment: YARN-1021.patch

 Yarn Scheduler Load Simulator
 -

 Key: YARN-1021
 URL: https://issues.apache.org/jira/browse/YARN-1021
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.pdf


 The Yarn Scheduler is a fertile area of interest with different 
 implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
 several optimizations are also made to improve scheduler performance for 
 different scenarios and workload. Each scheduler algorithm has its own set of 
 features, and drives scheduling decisions by many factors, such as fairness, 
 capacity guarantee, resource availability, etc. It is very important to 
 evaluate a scheduler algorithm very well before we deploy it in a production 
 cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
 algorithm. Evaluating in a real cluster is always time and cost consuming, 
 and it is also very hard to find a large-enough cluster. Hence, a simulator 
 which can predict how well a scheduler algorithm for some specific workload 
 would be quite useful.
 We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
 clusters and application loads in a single machine. This would be invaluable 
 in furthering Yarn by providing a tool for researchers and developers to 
 prototype new scheduler features and predict their behavior and performance 
 with reasonable amount of confidence, there-by aiding rapid innovation.
 The simulator will exercise the real Yarn ResourceManager removing the 
 network factor by simulating NodeManagers and ApplicationMasters via handling 
 and dispatching NM/AMs heartbeat events from within the same JVM.
 To keep tracking of scheduler behavior and performance, a scheduler wrapper 
 will wrap the real scheduler.
 The simulator will produce real time metrics while executing, including:
 * Resource usages for whole cluster and each queue, which can be utilized to 
 configure cluster and queue's capacity.
 * The detailed application execution trace (recorded in relation to simulated 
 time), which can be analyzed to understand/validate the  scheduler behavior 
 (individual jobs turn around time, throughput, fairness, capacity guarantee, 
 etc).
 * Several key metrics of scheduler algorithm, such as time cost of each 
 scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
 developers to find the code spots and scalability limits.
 The simulator will provide real time charts showing the behavior of the 
 scheduler and its performance.
 A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
 how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1021) Yarn Scheduler Load Simulator

2013-09-12 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1021:
--

Attachment: (was: YARN-1021.pdf)

 Yarn Scheduler Load Simulator
 -

 Key: YARN-1021
 URL: https://issues.apache.org/jira/browse/YARN-1021
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.pdf


 The Yarn Scheduler is a fertile area of interest with different 
 implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
 several optimizations are also made to improve scheduler performance for 
 different scenarios and workload. Each scheduler algorithm has its own set of 
 features, and drives scheduling decisions by many factors, such as fairness, 
 capacity guarantee, resource availability, etc. It is very important to 
 evaluate a scheduler algorithm very well before we deploy it in a production 
 cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
 algorithm. Evaluating in a real cluster is always time and cost consuming, 
 and it is also very hard to find a large-enough cluster. Hence, a simulator 
 which can predict how well a scheduler algorithm for some specific workload 
 would be quite useful.
 We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
 clusters and application loads in a single machine. This would be invaluable 
 in furthering Yarn by providing a tool for researchers and developers to 
 prototype new scheduler features and predict their behavior and performance 
 with reasonable amount of confidence, there-by aiding rapid innovation.
 The simulator will exercise the real Yarn ResourceManager removing the 
 network factor by simulating NodeManagers and ApplicationMasters via handling 
 and dispatching NM/AMs heartbeat events from within the same JVM.
 To keep tracking of scheduler behavior and performance, a scheduler wrapper 
 will wrap the real scheduler.
 The simulator will produce real time metrics while executing, including:
 * Resource usages for whole cluster and each queue, which can be utilized to 
 configure cluster and queue's capacity.
 * The detailed application execution trace (recorded in relation to simulated 
 time), which can be analyzed to understand/validate the  scheduler behavior 
 (individual jobs turn around time, throughput, fairness, capacity guarantee, 
 etc).
 * Several key metrics of scheduler algorithm, such as time cost of each 
 scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
 developers to find the code spots and scalability limits.
 The simulator will provide real time charts showing the behavior of the 
 scheduler and its performance.
 A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
 how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1149) NM throws InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FINISHED at RUNNING

2013-09-12 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765722#comment-13765722
 ] 

Xuan Gong commented on YARN-1149:
-

bq.It creates additional overhead, as the events will be sent every 1s. As the 
new requests will be blocked, what we need to prevent is that while 
stopService() is running, startContainers() shouldn't be on the half way, where 
the application is created but not added into the context. Then, if 
unfortunately CMgrCompletedAppsEvent is sent at this time, the newly created 
application will not be notified. Maybe we can use synchronization to prevent 
stopService() and startContainers() from being interpolated

Yes, It does create some additional overhead. Maybe We can make the waiting 
time a little bit longer. For using synchronization to prevent stopService() 
and startContainers() from being interpolated, I am thinking that we may have 
the race condition problem. I think that containermanager should start to 
serviceStop() immediately without waiting. Imaging that if we have several 
startContainers() calls, and one serviceStop() call, we can not guarantee when 
the serviceStop() will be executed. It is very possible that it need to wait 
for all the startContainers() requests been finished.

 NM throws InvalidStateTransitonException: Invalid event: 
 APPLICATION_LOG_HANDLING_FINISHED at RUNNING
 -

 Key: YARN-1149
 URL: https://issues.apache.org/jira/browse/YARN-1149
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ramya Sunil
Assignee: Xuan Gong
 Fix For: 2.1.1-beta

 Attachments: YARN-1149.1.patch, YARN-1149.2.patch, YARN-1149.3.patch, 
 YARN-1149.4.patch


 When nodemanager receives a kill signal when an application has finished 
 execution but log aggregation has not kicked in, 
 InvalidStateTransitonException: Invalid event: 
 APPLICATION_LOG_HANDLING_FINISHED at RUNNING is thrown
 {noformat}
 2013-08-25 20:45:00,875 INFO  logaggregation.AppLogAggregatorImpl 
 (AppLogAggregatorImpl.java:finishLogAggregation(254)) - Application just 
 finished : application_1377459190746_0118
 2013-08-25 20:45:00,876 INFO  logaggregation.AppLogAggregatorImpl 
 (AppLogAggregatorImpl.java:uploadLogsForContainer(105)) - Starting aggregate 
 log-file for app application_1377459190746_0118 at 
 /app-logs/foo/logs/application_1377459190746_0118/host_45454.tmp
 2013-08-25 20:45:00,876 INFO  logaggregation.LogAggregationService 
 (LogAggregationService.java:stopAggregators(151)) - Waiting for aggregation 
 to complete for application_1377459190746_0118
 2013-08-25 20:45:00,891 INFO  logaggregation.AppLogAggregatorImpl 
 (AppLogAggregatorImpl.java:uploadLogsForContainer(122)) - Uploading logs for 
 container container_1377459190746_0118_01_04. Current good log dirs are 
 /tmp/yarn/local
 2013-08-25 20:45:00,915 INFO  logaggregation.AppLogAggregatorImpl 
 (AppLogAggregatorImpl.java:doAppLogAggregation(182)) - Finished aggregate 
 log-file for app application_1377459190746_0118
 2013-08-25 20:45:00,925 WARN  application.Application 
 (ApplicationImpl.java:handle(427)) - Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 APPLICATION_LOG_HANDLING_FINISHED at RUNNING
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
  
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:425)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:59)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:697)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:689)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)   
 at java.lang.Thread.run(Thread.java:662)
 2013-08-25 20:45:00,926 INFO  application.Application 
 (ApplicationImpl.java:handle(430)) - Application 
 application_1377459190746_0118 transitioned from RUNNING to null
 2013-08-25 20:45:00,927 WARN  monitor.ContainersMonitorImpl 
 (ContainersMonitorImpl.java:run(463)) - 
 

[jira] [Commented] (YARN-1166) YARN 'appsFailed' metric should be of type 'counter'

2013-09-12 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765709#comment-13765709
 ] 

Akira AJISAKA commented on YARN-1166:
-

[~vinodkv], thanks for your comment. 
I understand, but I don't know how to find the 'final' attempt to increment 
'appsFailed' counter.
Please tell me how to find it?

I attach a patch to decrement the counter by incrementing -1, this is a quick 
workaround.

 YARN 'appsFailed' metric should be of type 'counter'
 

 Key: YARN-1166
 URL: https://issues.apache.org/jira/browse/YARN-1166
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Akira AJISAKA
Priority: Blocker
 Attachments: YARN-1166.2.patch, YARN-1166.3.patch, YARN-1166.patch


 Currently in YARN's queue metrics, the cumulative metric 'appsFailed' is of 
 type 'guage' - which means the exact value will be reported. 
 All other cumulative queue metrics (AppsSubmitted, AppsCompleted, AppsKilled) 
 are all of type 'counter' - meaning Ganglia will use slope to provide deltas 
 between time-points.
 To be consistent, AppsFailed metric should also be of type 'counter'. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1021) Yarn Scheduler Load Simulator

2013-09-12 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1021:
--

Attachment: YARN-1021.pdf

 Yarn Scheduler Load Simulator
 -

 Key: YARN-1021
 URL: https://issues.apache.org/jira/browse/YARN-1021
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.pdf


 The Yarn Scheduler is a fertile area of interest with different 
 implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
 several optimizations are also made to improve scheduler performance for 
 different scenarios and workload. Each scheduler algorithm has its own set of 
 features, and drives scheduling decisions by many factors, such as fairness, 
 capacity guarantee, resource availability, etc. It is very important to 
 evaluate a scheduler algorithm very well before we deploy it in a production 
 cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
 algorithm. Evaluating in a real cluster is always time and cost consuming, 
 and it is also very hard to find a large-enough cluster. Hence, a simulator 
 which can predict how well a scheduler algorithm for some specific workload 
 would be quite useful.
 We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
 clusters and application loads in a single machine. This would be invaluable 
 in furthering Yarn by providing a tool for researchers and developers to 
 prototype new scheduler features and predict their behavior and performance 
 with reasonable amount of confidence, there-by aiding rapid innovation.
 The simulator will exercise the real Yarn ResourceManager removing the 
 network factor by simulating NodeManagers and ApplicationMasters via handling 
 and dispatching NM/AMs heartbeat events from within the same JVM.
 To keep tracking of scheduler behavior and performance, a scheduler wrapper 
 will wrap the real scheduler.
 The simulator will produce real time metrics while executing, including:
 * Resource usages for whole cluster and each queue, which can be utilized to 
 configure cluster and queue's capacity.
 * The detailed application execution trace (recorded in relation to simulated 
 time), which can be analyzed to understand/validate the  scheduler behavior 
 (individual jobs turn around time, throughput, fairness, capacity guarantee, 
 etc).
 * Several key metrics of scheduler algorithm, such as time cost of each 
 scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
 developers to find the code spots and scalability limits.
 The simulator will provide real time charts showing the behavior of the 
 scheduler and its performance.
 A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
 how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1078) TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater fail on Windows

2013-09-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765562#comment-13765562
 ] 

Hudson commented on YARN-1078:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4404 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4404/])
YARN-1078. TestNodeManagerResync, TestNodeManagerShutdown, and 
TestNodeStatusUpdater fail on Windows. Contributed by Chuan Liu. (cnauroth: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1522644)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeManagerShutdown.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java


 TestNodeManagerResync, TestNodeManagerShutdown, and TestNodeStatusUpdater 
 fail on Windows
 -

 Key: YARN-1078
 URL: https://issues.apache.org/jira/browse/YARN-1078
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.3.0
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor
 Attachments: YARN-1078.2.patch, YARN-1078.3.patch, 
 YARN-1078.branch-2.patch, YARN-1078.patch


 The three unit tests fail on Windows due to host name resolution differences 
 on Windows, i.e. 127.0.0.1 does not resolve to host name localhost.
 {noformat}
 org.apache.hadoop.security.token.SecretManager$InvalidToken: Given Container 
 container_0__01_00 identifier is not valid for current Node manager. 
 Expected : 127.0.0.1:12345 Found : localhost:12345
 {noformat}
 {noformat}
 testNMConnectionToRM(org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater)
   Time elapsed: 8343 sec   FAILURE!
 org.junit.ComparisonFailure: expected:[localhost]:12345 but 
 was:[127.0.0.1]:12345
   at org.junit.Assert.assertEquals(Assert.java:125)
   at org.junit.Assert.assertEquals(Assert.java:147)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyResourceTracker6.registerNodeManager(TestNodeStatusUpdater.java:712)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at $Proxy26.registerNodeManager(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:212)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:149)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyNodeStatusUpdater4.serviceStart(TestNodeStatusUpdater.java:369)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:101)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:213)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNMConnectionToRM(TestNodeStatusUpdater.java:985)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-867) Isolation of failures in aux services

2013-09-12 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-867:
---

Attachment: YARN-867.4.patch

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-12 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765696#comment-13765696
 ] 

Xuan Gong commented on YARN-867:


NEW patch added more transitions in ContainerSEate.EXITED_WITH_FAILURE and 
ContainerState.DONE. This patch still handles the 
AuxServicesEventType.APPLICATION_INIT and handles exceptions at the container 
level. 

I thought about moving AuxServicesEventType.APPLICATION_INIT into application. 
But I do not think that we will get any benefits. The reasons are :
1. There are two newly events : AuxServicesEvent.CONTAINER_INIT and 
AuxServicesEvent.CONTAINER_STOP. We need to handle them at container level.
2. Even if we move AuxServicesEventType.APPLICATION_INIT into application, we 
will have two options :
   a. We will not start any containers until all the AuxServices finish their 
APPLICATION_INIT. If we choose this, that definitely simplify the problem. When 
there is any exceptions from APPLICATION_INIT on AuxServices, just simply kill 
the applications. But does it make sense that we need to block all the 
containers ?
   b. We can let AuxServices do APPLICATION_INIT and container starts at the 
same time, if this is the case, we will go to the same process as now. Because, 
when the container receives the CONTAINER_EXITED_WITH_FAILURE event, we can not 
guarantee which state the container is, maybe at killing state, LOCALIZED 
state, etc. Any state is possible.


 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes

2013-09-12 Thread Luke Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765760#comment-13765760
 ] 

Luke Lu commented on YARN-311:
--

I think the v6.2 patch looks reasonable. If there is no further objections, I 
plan to commit it over the weekend. 

 Dynamic node resource configuration: core scheduler changes
 ---

 Key: YARN-311
 URL: https://issues.apache.org/jira/browse/YARN-311
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager, scheduler
Reporter: Junping Du
Assignee: Junping Du
 Attachments: YARN-311-v1.patch, YARN-311-v2.patch, YARN-311-v3.patch, 
 YARN-311-v4.patch, YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, 
 YARN-311-v6.2.patch, YARN-311-v6.patch


 As the first step, we go for resource change on RM side and expose admin APIs 
 (admin protocol, CLI, REST and JMX API) later. In this jira, we will only 
 contain changes in scheduler. 
 The flow to update node's resource and awareness in resource scheduling is: 
 1. Resource update is through admin API to RM and take effect on RMNodeImpl.
 2. When next NM heartbeat for updating status comes, the RMNode's resource 
 change will be aware and the delta resource is added to schedulerNode's 
 availableResource before actual scheduling happens.
 3. Scheduler do resource allocation according to new availableResource in 
 SchedulerNode.
 For more design details, please refer proposal and discussions in parent 
 JIRA: YARN-291.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1166) YARN 'appsFailed' metric should be of type 'counter'

2013-09-12 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765764#comment-13765764
 ] 

Akira AJISAKA commented on YARN-1166:
-

[~srimanth.gunturi], this fix seems simple. I've already uploaded 
'YARN-1166.2.patch' for this.
However, if the first ApplicationAttempt fails and the second 
ApplicationAttempt succeeds, the counter is incremented with this patch.
Do you think is that correct behavior?

 YARN 'appsFailed' metric should be of type 'counter'
 

 Key: YARN-1166
 URL: https://issues.apache.org/jira/browse/YARN-1166
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Akira AJISAKA
Priority: Blocker
 Attachments: YARN-1166.2.patch, YARN-1166.3.patch, YARN-1166.patch


 Currently in YARN's queue metrics, the cumulative metric 'appsFailed' is of 
 type 'guage' - which means the exact value will be reported. 
 All other cumulative queue metrics (AppsSubmitted, AppsCompleted, AppsKilled) 
 are all of type 'counter' - meaning Ganglia will use slope to provide deltas 
 between time-points.
 To be consistent, AppsFailed metric should also be of type 'counter'. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data

2013-09-12 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765622#comment-13765622
 ] 

Jason Lowe commented on YARN-1185:
--

Also, couldn't it be left with zero-length files if it dies after create but 
before write can occur?

 FileSystemRMStateStore doesn't use temporary files when writing data
 

 Key: YARN-1185
 URL: https://issues.apache.org/jira/browse/YARN-1185
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe

 FileSystemRMStateStore writes directly to the destination file when storing 
 state. However if the RM were to crash in the middle of the write, the 
 recovery method could encounter a partially-written file and either outright 
 crash during recovery or silently load incomplete state.
 To avoid this, the data should be written to a temporary file and renamed to 
 the destination file afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data

2013-09-12 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765620#comment-13765620
 ] 

Jason Lowe commented on YARN-1185:
--

Ah I see.  That's an assumption based on a specific FileSystem implementation 
-- does it only work with HDFS?

 FileSystemRMStateStore doesn't use temporary files when writing data
 

 Key: YARN-1185
 URL: https://issues.apache.org/jira/browse/YARN-1185
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe

 FileSystemRMStateStore writes directly to the destination file when storing 
 state. However if the RM were to crash in the middle of the write, the 
 recovery method could encounter a partially-written file and either outright 
 crash during recovery or silently load incomplete state.
 To avoid this, the data should be written to a temporary file and renamed to 
 the destination file afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1166) YARN 'appsFailed' metric should be of type 'counter'

2013-09-12 Thread Srimanth Gunturi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765716#comment-13765716
 ] 

Srimanth Gunturi commented on YARN-1166:


Isnt the fix as simple as changing {{appsFailed}} type to 
{{MutableCounterInt}}, and removing the below?
{code}
else {
   appsFailed.decr();
}
{code}

This would make appsFailed cumulative, as in {{finishApp(AppSchedulingInfo, 
RMAppAttemptState)}} we do the {{appsFailed.incr()}} on the final attempt state?

 YARN 'appsFailed' metric should be of type 'counter'
 

 Key: YARN-1166
 URL: https://issues.apache.org/jira/browse/YARN-1166
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Akira AJISAKA
Priority: Blocker
 Attachments: YARN-1166.2.patch, YARN-1166.3.patch, YARN-1166.patch


 Currently in YARN's queue metrics, the cumulative metric 'appsFailed' is of 
 type 'guage' - which means the exact value will be reported. 
 All other cumulative queue metrics (AppsSubmitted, AppsCompleted, AppsKilled) 
 are all of type 'counter' - meaning Ganglia will use slope to provide deltas 
 between time-points.
 To be consistent, AppsFailed metric should also be of type 'counter'. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator

2013-09-12 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765772#comment-13765772
 ] 

Wei Yan commented on YARN-1021:
---

Update a new patch according to [~curino]'s comments.
Main updates:
(1) Let simulator adopt rumen traces directly, using the Rumen Reader 
(JobStory). Document also updated.
(2) Add a testcase that starts the simulator using an example trace. Run 2 min 
and stop.
(3) Fix several minors (LICENSE/NOTICE, javadoc, string appends, etc).

For simulating slowstart, headroom and some other behaviors, opened jira 
YARN-1186.
For discreate event-based simulation, opened jira YARN-1187. 

 Yarn Scheduler Load Simulator
 -

 Key: YARN-1021
 URL: https://issues.apache.org/jira/browse/YARN-1021
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.pdf


 The Yarn Scheduler is a fertile area of interest with different 
 implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
 several optimizations are also made to improve scheduler performance for 
 different scenarios and workload. Each scheduler algorithm has its own set of 
 features, and drives scheduling decisions by many factors, such as fairness, 
 capacity guarantee, resource availability, etc. It is very important to 
 evaluate a scheduler algorithm very well before we deploy it in a production 
 cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
 algorithm. Evaluating in a real cluster is always time and cost consuming, 
 and it is also very hard to find a large-enough cluster. Hence, a simulator 
 which can predict how well a scheduler algorithm for some specific workload 
 would be quite useful.
 We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
 clusters and application loads in a single machine. This would be invaluable 
 in furthering Yarn by providing a tool for researchers and developers to 
 prototype new scheduler features and predict their behavior and performance 
 with reasonable amount of confidence, there-by aiding rapid innovation.
 The simulator will exercise the real Yarn ResourceManager removing the 
 network factor by simulating NodeManagers and ApplicationMasters via handling 
 and dispatching NM/AMs heartbeat events from within the same JVM.
 To keep tracking of scheduler behavior and performance, a scheduler wrapper 
 will wrap the real scheduler.
 The simulator will produce real time metrics while executing, including:
 * Resource usages for whole cluster and each queue, which can be utilized to 
 configure cluster and queue's capacity.
 * The detailed application execution trace (recorded in relation to simulated 
 time), which can be analyzed to understand/validate the  scheduler behavior 
 (individual jobs turn around time, throughput, fairness, capacity guarantee, 
 etc).
 * Several key metrics of scheduler algorithm, such as time cost of each 
 scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
 developers to find the code spots and scalability limits.
 The simulator will provide real time charts showing the behavior of the 
 scheduler and its performance.
 A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
 how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-12 Thread Andrey Klochkov (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765735#comment-13765735
 ] 

Andrey Klochkov commented on YARN-1183:
---

Yep, you're right about trackingUrl being able to change. Please disregard this 
part.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data

2013-09-12 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-1185:


 Summary: FileSystemRMStateStore doesn't use temporary files when 
writing data
 Key: YARN-1185
 URL: https://issues.apache.org/jira/browse/YARN-1185
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe


FileSystemRMStateStore writes directly to the destination file when storing 
state. However if the RM were to crash in the middle of the write, the recovery 
method could encounter a partially-written file and either outright crash 
during recovery or silently load incomplete state.

To avoid this, the data should be written to a temporary file and renamed to 
the destination file afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1130) Improve the log flushing for tasks when mapred.userlog.limit.kb is set

2013-09-12 Thread Paul Han (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765870#comment-13765870
 ] 

Paul Han commented on YARN-1130:


Patch is submitted. 
A few notes here:
# Added a LogUtils class to facilitate triggering log flush since Log4J's 
interface such as LogManager doesn't provide a flush() interface
# Modified Task class to trigger the flushing of logs before it sends TASK_DONE 
event to MRMaster. In some cases, TASK_DONE may trigger the container to be 
killed before all log being written to disk
# ContainerLogAppender supports to flush the log when a special message is 
received. The log flush is done in a synchronous manner with a timeout. This 
will ensure the invoker of flush waits until logs are written to disk or 
timeout happens.

 Improve the log flushing for tasks when mapred.userlog.limit.kb is set
 --

 Key: YARN-1130
 URL: https://issues.apache.org/jira/browse/YARN-1130
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Paul Han
 Fix For: 2.0.5-alpha

 Attachments: YARN-1130.patch


 When userlog limit is set with something like this:
 {code}
 property
 namemapred.userlog.limit.kb/name
 value2048/value
 descriptionThe maximum size of user-logs of each task in KB. 0 disables the 
 cap.
 /description
 /property
 {code}
 the log entry will be truncated randomly for the jobs.
 The log size is left between 1.2MB to 1.6MB.
 Since the log is already limited, avoid the log truncation is crucial for 
 user.
 The other issue with the current 
 impl(org.apache.hadoop.yarn.ContainerLogAppender) is that log entries will 
 not flush to file until the container shutdown and logmanager close all 
 appenders. If user likes to see the log during task execution, it doesn't 
 support it.
 Will propose a patch to add a flush mechanism and also flush the log when 
 task is done.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1189) NMTokenSecretManagerInNM is not being told when applications have finished

2013-09-12 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-1189:


 Summary: NMTokenSecretManagerInNM is not being told when 
applications have finished 
 Key: YARN-1189
 URL: https://issues.apache.org/jira/browse/YARN-1189
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe


The {{appFinished}} method is not being called when applications have finished. 
 This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} 
are never being pruned.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1185) FileSystemRMStateStore doesn't use temporary files when writing data

2013-09-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765607#comment-13765607
 ] 

Bikas Saha commented on YARN-1185:
--

The expectation was that the data being written to each file is so small that 
it gets sent over completely in the first transfer. So partial writes would not 
occur. Thats why the code accumulates the write buffer locally and then issues 
a single write to the FileSystem.

 FileSystemRMStateStore doesn't use temporary files when writing data
 

 Key: YARN-1185
 URL: https://issues.apache.org/jira/browse/YARN-1185
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe

 FileSystemRMStateStore writes directly to the destination file when storing 
 state. However if the RM were to crash in the middle of the write, the 
 recovery method could encounter a partially-written file and either outright 
 crash during recovery or silently load incomplete state.
 To avoid this, the data should be written to a temporary file and renamed to 
 the destination file afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1183) MiniYARNCluster shutdown takes several minutes intermittently

2013-09-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765891#comment-13765891
 ] 

Hadoop QA commented on YARN-1183:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12602875/YARN-1183--n2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1905//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1905//console

This message is automatically generated.

 MiniYARNCluster shutdown takes several minutes intermittently
 -

 Key: YARN-1183
 URL: https://issues.apache.org/jira/browse/YARN-1183
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Andrey Klochkov
 Attachments: YARN-1183--n2.patch, YARN-1183.patch


 As described in MAPREDUCE-5501 sometimes M/R tests leave MRAppMaster java 
 processes living for several minutes after successful completion of the 
 corresponding test. There is a concurrency issue in MiniYARNCluster shutdown 
 logic which leads to this. Sometimes RM stops before an app master sends it's 
 last report, and then the app master keeps retrying for 6 minutes. In some 
 cases it leads to failures in subsequent tests, and it affects performance of 
 tests as app masters eat resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1166) YARN 'appsFailed' metric should be of type 'counter'

2013-09-12 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-1166:


Attachment: YARN-1166.3.patch

 YARN 'appsFailed' metric should be of type 'counter'
 

 Key: YARN-1166
 URL: https://issues.apache.org/jira/browse/YARN-1166
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Srimanth Gunturi
Assignee: Akira AJISAKA
Priority: Blocker
 Attachments: YARN-1166.2.patch, YARN-1166.3.patch, YARN-1166.patch


 Currently in YARN's queue metrics, the cumulative metric 'appsFailed' is of 
 type 'guage' - which means the exact value will be reported. 
 All other cumulative queue metrics (AppsSubmitted, AppsCompleted, AppsKilled) 
 are all of type 'counter' - meaning Ganglia will use slope to provide deltas 
 between time-points.
 To be consistent, AppsFailed metric should also be of type 'counter'. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator

2013-09-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765758#comment-13765758
 ] 

Hadoop QA commented on YARN-1021:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12602846/YARN-1021.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-assemblies hadoop-tools/hadoop-sls hadoop-tools/hadoop-tools-dist.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1903//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1903//console

This message is automatically generated.

 Yarn Scheduler Load Simulator
 -

 Key: YARN-1021
 URL: https://issues.apache.org/jira/browse/YARN-1021
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
 YARN-1021.pdf


 The Yarn Scheduler is a fertile area of interest with different 
 implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
 several optimizations are also made to improve scheduler performance for 
 different scenarios and workload. Each scheduler algorithm has its own set of 
 features, and drives scheduling decisions by many factors, such as fairness, 
 capacity guarantee, resource availability, etc. It is very important to 
 evaluate a scheduler algorithm very well before we deploy it in a production 
 cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
 algorithm. Evaluating in a real cluster is always time and cost consuming, 
 and it is also very hard to find a large-enough cluster. Hence, a simulator 
 which can predict how well a scheduler algorithm for some specific workload 
 would be quite useful.
 We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
 clusters and application loads in a single machine. This would be invaluable 
 in furthering Yarn by providing a tool for researchers and developers to 
 prototype new scheduler features and predict their behavior and performance 
 with reasonable amount of confidence, there-by aiding rapid innovation.
 The simulator will exercise the real Yarn ResourceManager removing the 
 network factor by simulating NodeManagers and ApplicationMasters via handling 
 and dispatching NM/AMs heartbeat events from within the same JVM.
 To keep tracking of scheduler behavior and performance, a scheduler wrapper 
 will wrap the real scheduler.
 The simulator will produce real time metrics while executing, including:
 * Resource usages for whole cluster and each queue, which can be utilized to 
 configure cluster and queue's capacity.
 * The detailed application execution trace (recorded in relation to simulated 
 time), which can be analyzed to understand/validate the  scheduler behavior 
 (individual jobs turn around time, throughput, fairness, capacity guarantee, 
 etc).
 * Several key metrics of scheduler algorithm, such as time cost of each 
 scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
 developers to find the code spots and scalability limits.
 The simulator will provide real time charts showing the behavior of the 
 scheduler and its performance.
 A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
 how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765730#comment-13765730
 ] 

Hadoop QA commented on YARN-867:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12602838/YARN-867.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1902//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1902//console

This message is automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1190) enabling uber mde with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb

2013-09-12 Thread Siqi Li (JIRA)
Siqi Li created YARN-1190:
-

 Summary: enabling uber mde with 0 reducer still requires 
mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb
 Key: YARN-1190
 URL: https://issues.apache.org/jira/browse/YARN-1190
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siqi Li
Priority: Minor




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1190) enabling uber mde with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb

2013-09-12 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li updated YARN-1190:
--

Description: Since there is no reducer, the memory allocated to reducer is 
irrelevant to enable uber mode of a job

 enabling uber mde with 0 reducer still requires mapreduce.reduce.memory.mb to 
 be less than yarn.app.mapreduce.am.resource.mb
 

 Key: YARN-1190
 URL: https://issues.apache.org/jira/browse/YARN-1190
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siqi Li
Priority: Minor

 Since there is no reducer, the memory allocated to reducer is irrelevant to 
 enable uber mode of a job

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1188) The context of QueueMetrics becomes 'default' when using FairScheduler

2013-09-12 Thread Akira AJISAKA (JIRA)
Akira AJISAKA created YARN-1188:
---

 Summary: The context of QueueMetrics becomes 'default' when using 
FairScheduler
 Key: YARN-1188
 URL: https://issues.apache.org/jira/browse/YARN-1188
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Akira AJISAKA
Priority: Minor
 Fix For: 2.3.0, 3.0.0


I found the context of QueueMetrics changed to 'default' from 'yarn' when I was 
using FairScheduler.
The context should always be 'yarn' by adding an annotation to FSQueueMetrics 
like below:

{code}
+ @Metrics(context=yarn)
public class FSQueueMetrics extends QueueMetrics {
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch

2013-09-12 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-1191:


Attachment: YARN-1191-1.patch

Attaching patch.

Thanks,
Mayank

 [YARN-321] Compilation is failing for YARN-321 branch
 -

 Key: YARN-1191
 URL: https://issues.apache.org/jira/browse/YARN-1191
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-1191-1.patch


 Compilation is failing for YARN-321 branch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch

2013-09-12 Thread Mayank Bansal (JIRA)
Mayank Bansal created YARN-1191:
---

 Summary: [YARN-321] Compilation is failing for YARN-321 branch
 Key: YARN-1191
 URL: https://issues.apache.org/jira/browse/YARN-1191
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Zhijie Shen


Like YARN-978, we need some client-oriented class to expose the container 
history info. Neither Container nor RMContainer is the right one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch

2013-09-12 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-1191:


Description: 
Compilation is failing for YARN-321 branch


  was:Like YARN-978, we need some client-oriented class to expose the container 
history info. Neither Container nor RMContainer is the right one.


 [YARN-321] Compilation is failing for YARN-321 branch
 -

 Key: YARN-1191
 URL: https://issues.apache.org/jira/browse/YARN-1191
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Zhijie Shen

 Compilation is failing for YARN-321 branch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch

2013-09-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766005#comment-13766005
 ] 

Hadoop QA commented on YARN-1191:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12602903/YARN-1191-1.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1906//console

This message is automatically generated.

 [YARN-321] Compilation is failing for YARN-321 branch
 -

 Key: YARN-1191
 URL: https://issues.apache.org/jira/browse/YARN-1191
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: YARN-1191-1.patch


 Compilation is failing for YARN-321 branch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-1191) [YARN-321] Compilation is failing for YARN-321 branch

2013-09-12 Thread Mayank Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal reassigned YARN-1191:
---

Assignee: Mayank Bansal  (was: Zhijie Shen)

 [YARN-321] Compilation is failing for YARN-321 branch
 -

 Key: YARN-1191
 URL: https://issues.apache.org/jira/browse/YARN-1191
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Mayank Bansal

 Compilation is failing for YARN-321 branch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-1157) ResourceManager UI has invalid tracking URL link for distributed shell application

2013-09-12 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong reassigned YARN-1157:
---

Assignee: Xuan Gong

 ResourceManager UI has invalid tracking URL link for distributed shell 
 application
 --

 Key: YARN-1157
 URL: https://issues.apache.org/jira/browse/YARN-1157
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Tassapol Athiapinya
Assignee: Xuan Gong
 Fix For: 2.1.1-beta


 Submit YARN distributed shell application. Goto ResourceManager Web UI. The 
 application definitely appears. In Tracking UI column, there will be history 
 link. Click on that link. Instead of showing application master web UI, HTTP 
 error 500 would appear.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation

2013-09-12 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766028#comment-13766028
 ] 

Mayank Bansal commented on YARN-978:


If we can add the tracking url that would be useful in UI

Thanks,
Mayank

 [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
 --

 Key: YARN-978
 URL: https://issues.apache.org/jira/browse/YARN-978
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Xuan Gong
 Fix For: YARN-321

 Attachments: YARN-978-1.patch, YARN-978.2.patch, YARN-978.3.patch, 
 YARN-978.4.patch, YARN-978.5.patch, YARN-978.6.patch, YARN-978.7.patch


 We dont have ApplicationAttemptReport and Protobuf implementation.
 Adding that.
 Thanks,
 Mayank

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-12 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-540:
-

Attachment: YARN-540.7.patch

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766032#comment-13766032
 ] 

Jian He commented on YARN-540:
--

Thanks for the detailed comments, upload a new patch

bq. there is a member variable in AMRMClient that is used to get the ping 
interval from config
Turns out we have one in AMRMClientAsync but not in AMRMClientImpl, 
bq. Is delete not throwing an exception for non-existent location?
Delete throws exception in case of not-existing
bq. Can RMAppEventType.ATTEMPT_FAILED be received when in REMOVING state (and 
also when in FINISHING state)?
Once we moved to REMOVING/FINISHING state, it indicates attempt goes to 
FINISHING state, it should not be possible to generate 
RMAppEventType.ATTEMPT_FAILED event at that state
bq. What is the YARNApplicationState enum corresponding to AppState.REMOVING?
In case of REMOVING, return YARNApplicationState as RUNNING, makes sense?

Addressed other comments also.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

2013-09-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766040#comment-13766040
 ] 

Hadoop QA commented on YARN-540:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12602910/YARN-540.7.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1907//console

This message is automatically generated.

 Race condition causing RM to potentially relaunch already unregistered AMs on 
 RM restart
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
 YARN-540.4.patch, YARN-540.5.patch, YARN-540.6.patch, YARN-540.7.patch, 
 YARN-540.patch, YARN-540.patch


 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1188) The context of QueueMetrics becomes 'default' when using FairScheduler

2013-09-12 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-1188:


Target Version/s: 2.3.0

 The context of QueueMetrics becomes 'default' when using FairScheduler
 --

 Key: YARN-1188
 URL: https://issues.apache.org/jira/browse/YARN-1188
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Akira AJISAKA
Priority: Minor
  Labels: metrics, newbie

 I found the context of QueueMetrics changed to 'default' from 'yarn' when I 
 was using FairScheduler.
 The context should always be 'yarn' by adding an annotation to FSQueueMetrics 
 like below:
 {code}
 + @Metrics(context=yarn)
 public class FSQueueMetrics extends QueueMetrics {
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-1188) The context of QueueMetrics becomes 'default' when using FairScheduler

2013-09-12 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-1188:


Fix Version/s: (was: 2.3.0)
   (was: 3.0.0)

 The context of QueueMetrics becomes 'default' when using FairScheduler
 --

 Key: YARN-1188
 URL: https://issues.apache.org/jira/browse/YARN-1188
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Akira AJISAKA
Priority: Minor
  Labels: metrics, newbie

 I found the context of QueueMetrics changed to 'default' from 'yarn' when I 
 was using FairScheduler.
 The context should always be 'yarn' by adding an annotation to FSQueueMetrics 
 like below:
 {code}
 + @Metrics(context=yarn)
 public class FSQueueMetrics extends QueueMetrics {
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira