[ 
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745286#comment-13745286
 ] 

Wei Yan commented on YARN-1021:
-------------------------------

[~curino]. Thanks for taking the time and doing such detailed review. 

Addressing your comments:

bq. it only simulates the Scheduler code, mocking out most of the RM, and all 
AM and NM, communication, submissions...

This is not entirely correct, the simulator exercises all RM code except for 
incoming/outgoing networks. The simulator dispatches/processes the events that 
should be received/sent via the network from/to AMs and NMs. Running all AM and 
NM code in the simulator would be too much overhead.

bq. If I am not mistaken runs at wall-clock time (not faster)

Correct.

Though adding a speed factor is a good idea and it would not be that difficult, 
though it will require some work in Hadoop (as you are suggesting later in your 
comments on 'hijacking Clocks': Hadoop already has a Clock interface with a 
SystemClock implementation, but it is hardcoded to use 
System.currentTimeMillis(). SystemClock should be modified to use a 
configurable implementation (in our speedy case, a FastSystemClock). 

The Simulator code needs to be modified to use Hadoop Clock instead of 
System.currentTimeMillis().

In addition, because the simulator leverages JDK ScheduledExecutor to execute 
commands based on time, we need to introduce the correction there. For this 
Hadoop Clock interface should expose a RealTimeRatio via a new method, and the 
default implementation would return 1 (and the speedy one would return the 
corresponding factor). 

Also, we'll have to scan all RM code to ensure Clock is always used for all 
time base computations.

I'll open a JIRA for this.

bq. does not run the "monitors" which are needed for simulating preemption in 
the CapacityScheduler

If preemption is exclusively decided in the RM side, this should just work.

If the monitors run in the NMs, then we would need to simulate this.

Or am I missing something? If I am, I'd open a JIRA to handle this.

bq. It should be possible to consistently replay (seed RANDOM)

What do you exactly mean? You can replay a load and things will work similarly 
but not exactly the same.

bq. Using Rumen reader (JobProducer, etc.) instead of parsing json directly 
seems cleaner. Also we have a synth load generator which we will release soon 
that implements the JobProducer/JobStory interface (might be nice to use that 
to drive your simulations)

Sure, I'll make this changes and upload a new patch.

bq.LICENSE/NOTICE

I'll update those files.

bq. It seems somewhat hard to detect regressions w/ trunk since: mocks away 
much of the AM/NM/RM

It mocks AM/NM, it does not mock RM.

bq. It seems somewhat hard to detect regressions w/ trunk since: few unit tests

I'll try to make a testcase that starts mini MR cluster, runs a job, captures 
the jobhistory,  runs rumen on it and runs a simulation. (cannot promise, but 
I'll try)

bq. does not simulate important behaviors in the AM (no slow start, headroom, 
etc.)

Yes, I'm aware of this, I'll open a new JIRA to address these things.

bq. does not exercise failures, timeouts, etc.

Same as previous comment, I'll open a new JIRA to address these things.

bq. some javadoc @param unpopulated

I'll take care of those

bq. why a dependency on another metrics package, instead of Hadoop's?

The metrics package used by the simulator has a few features out of the box not 
available in Hadoop's metrics:

* writes output files with all metrics information automatically (every # 
seconds) that you can use to analyze the simulation. Hadoop metrics requires 
more work to generate these output files.
* these outputs also contain stddev, percentiles, min/max, etc, which are not 
completely implemented in Hadoop metrics.

bq. use ResourceCalculator instad of manually...

I'll do that.

bq. initMethics is a very large method ....

I'll break it down

bq. SLSWebApp: ....

I'll make them template files


I'll open jiras and update the patch.

                
> Yarn Scheduler Load Simulator
> -----------------------------
>
>                 Key: YARN-1021
>                 URL: https://issues.apache.org/jira/browse/YARN-1021
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: scheduler
>            Reporter: Wei Yan
>            Assignee: Wei Yan
>         Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, 
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, 
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf
>
>
> The Yarn Scheduler is a fertile area of interest with different 
> implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, 
> several optimizations are also made to improve scheduler performance for 
> different scenarios and workload. Each scheduler algorithm has its own set of 
> features, and drives scheduling decisions by many factors, such as fairness, 
> capacity guarantee, resource availability, etc. It is very important to 
> evaluate a scheduler algorithm very well before we deploy it in a production 
> cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling 
> algorithm. Evaluating in a real cluster is always time and cost consuming, 
> and it is also very hard to find a large-enough cluster. Hence, a simulator 
> which can predict how well a scheduler algorithm for some specific workload 
> would be quite useful.
> We want to build a Scheduler Load Simulator to simulate large-scale Yarn 
> clusters and application loads in a single machine. This would be invaluable 
> in furthering Yarn by providing a tool for researchers and developers to 
> prototype new scheduler features and predict their behavior and performance 
> with reasonable amount of confidence, there-by aiding rapid innovation.
> The simulator will exercise the real Yarn ResourceManager removing the 
> network factor by simulating NodeManagers and ApplicationMasters via handling 
> and dispatching NM/AMs heartbeat events from within the same JVM.
> To keep tracking of scheduler behavior and performance, a scheduler wrapper 
> will wrap the real scheduler.
> The simulator will produce real time metrics while executing, including:
> * Resource usages for whole cluster and each queue, which can be utilized to 
> configure cluster and queue's capacity.
> * The detailed application execution trace (recorded in relation to simulated 
> time), which can be analyzed to understand/validate the  scheduler behavior 
> (individual jobs turn around time, throughput, fairness, capacity guarantee, 
> etc).
> * Several key metrics of scheduler algorithm, such as time cost of each 
> scheduler operation (allocate, handle, etc), which can be utilized by Hadoop 
> developers to find the code spots and scalability limits.
> The simulator will provide real time charts showing the behavior of the 
> scheduler and its performance.
> A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing 
> how to use simulator to simulate Fair Scheduler and Capacity Scheduler.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to