[
https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13744493#comment-13744493
]
Carlo Curino commented on YARN-1021:
------------------------------------
Sorry for the delay. I went over the patch today together with Chris Douglas
and here some input from the both of us.
I generally like the effort, and the live visualization is really neat. Also
making it into a completely separate tool is convenient/safe.
The main limitations I see in this simulator are:
* it only simulates the Scheduler code, mocking out most of the RM, and all AM
and NM, communication, submissions...
* If I am not mistaken runs at wall-clock time (not faster)
* does not run the "monitors" which are needed for simulating preemption in the
CapacityScheduler
An alternative approach that we explored was to hijack the "Clocks" around the
RM and drive them using a discrete event simulation, thus exercising more of
the RM code, protocols etc... and enabling faster than wall-clock speeds
(though not trivial to achieve). We have some working but not polished code in
this space, which we could probably provide if you think might be
integrated/leveraged.
Ignoring alternative approaches, and broader spectrum we mentioned above, there
are few issues with the current patch:
* It should be possible to consistently replay (seed RANDOM)
* Using Rumen reader (JobProducer, etc.) instead of parsing json directly seems
cleaner. Also we have a synth load generator which we will release soon that
implements the JobProducer/JobStory interface (might be nice to use that to
drive your simulations)
* LICENSE/NOTICE should be updated to include the BSD-like licenses you bring
in with the new libraries
* It seems somewhat hard to detect regressions w/ trunk since:
** mocks away much of the AM/NM/RM
** few unit tests
** does not simulate important behaviors in the AM (no slow start, headroom,
etc.)
** does not exercise failures, timeouts, etc.
Smaller issues:
* some javadoc @param unpopulated
* why a dependency on another metrics package, instead of Hadoop's?
* why NodeUpdateSchedulerEventWrapper? Doesn't seem to add anything...
* use ResourceCalculator instead of manually adjusting Resources from RR
* initMetrics is a very large method...
* SLSWebApp: is a wall of string appends. I am not very web savvy but I believe
there should be cleaner ways to generate this. This seems hard to
maintain/evolve.
I hope this helps. I will be traveling abroad for a couple of weeks so I might
be slow/unresponsive. Altogether since it is rather "on a side" I am not too
concern about it, the suggestions are mostly to make sure it is really useful
and that people can use it / maintain it overtime. If committed as is will do
no harm, but I think it risk to be dropped in, used twice for FairScheduler
work, and than loose relevance and get out of sync from trunk.
> Yarn Scheduler Load Simulator
> -----------------------------
>
> Key: YARN-1021
> URL: https://issues.apache.org/jira/browse/YARN-1021
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: scheduler
> Reporter: Wei Yan
> Assignee: Wei Yan
> Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz,
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch,
> YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf
>
>
> The Yarn Scheduler is a fertile area of interest with different
> implementations, e.g., Fifo, Capacity and Fair schedulers. Meanwhile,
> several optimizations are also made to improve scheduler performance for
> different scenarios and workload. Each scheduler algorithm has its own set of
> features, and drives scheduling decisions by many factors, such as fairness,
> capacity guarantee, resource availability, etc. It is very important to
> evaluate a scheduler algorithm very well before we deploy it in a production
> cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling
> algorithm. Evaluating in a real cluster is always time and cost consuming,
> and it is also very hard to find a large-enough cluster. Hence, a simulator
> which can predict how well a scheduler algorithm for some specific workload
> would be quite useful.
> We want to build a Scheduler Load Simulator to simulate large-scale Yarn
> clusters and application loads in a single machine. This would be invaluable
> in furthering Yarn by providing a tool for researchers and developers to
> prototype new scheduler features and predict their behavior and performance
> with reasonable amount of confidence, there-by aiding rapid innovation.
> The simulator will exercise the real Yarn ResourceManager removing the
> network factor by simulating NodeManagers and ApplicationMasters via handling
> and dispatching NM/AMs heartbeat events from within the same JVM.
> To keep tracking of scheduler behavior and performance, a scheduler wrapper
> will wrap the real scheduler.
> The simulator will produce real time metrics while executing, including:
> * Resource usages for whole cluster and each queue, which can be utilized to
> configure cluster and queue's capacity.
> * The detailed application execution trace (recorded in relation to simulated
> time), which can be analyzed to understand/validate the scheduler behavior
> (individual jobs turn around time, throughput, fairness, capacity guarantee,
> etc).
> * Several key metrics of scheduler algorithm, such as time cost of each
> scheduler operation (allocate, handle, etc), which can be utilized by Hadoop
> developers to find the code spots and scalability limits.
> The simulator will provide real time charts showing the behavior of the
> scheduler and its performance.
> A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing
> how to use simulator to simulate Fair Scheduler and Capacity Scheduler.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira