Hi all,

We would like to open a discussion on merging the YARN Federation
(YARN-2915) [1] feature to trunk.  We have been developing the feature in a
feature branch (YARN-2915 [2]) for a while, and we are reasonably confident
that the state of the feature meets the criteria to be merged onto trunk.

*Key Ideas*:

YARN’s centralized design allows strict enforcement of scheduling
invariants and effective resource sharing, but becomes a scalability
bottleneck (in number of jobs and nodes) well before reaching the scale of
our clusters (e.g., 20k-50k nodes).


To address these limitations, we developed a scale-out, federation-based
solution (YARN-2915). Our architecture scales near-linearly to datacenter
sized clusters, by partitioning nodes across multiple sub-clusters (each
running a YARN cluster of few thousands nodes). Applications can span
multiple sub-clusters *transparently (i.e. no code change or recompilation
of existing apps)*, thanks to a layer of indirection that negotiates with
multiple sub-clusters' Resource Managers on behalf of the application.


This design is structurally scalable, as it bounds the number of nodes each
RM is responsible for. Appropriate policies ensure that the majority of
applications reside within a single sub-cluster, thus further controlling
the load on each RM. This provides near linear scale-out by simply adding
more sub-clusters. The same mechanism enables pooling of resources from
clusters owned and operated by different teams.

Status:

   - The version we would like to merge to trunk is termed "MVP" (minimal
   viable product). The feature will have a complete end-to-end application
   execution flow with the ability to span a single application across
   multiple YARN (sub) clusters.
   - There were 50+ sub-tasks that were that were completed as part of this
   effort. Every patch has been reviewed and +1ed by a committer. Thanks to
   Jian, Wangda, Karthik, Vinod, Varun & Arun for the thorough reviews!
   - Federation is designed to be built around YARN and consequently has
   minimal code changes to core YARN. The relevant JIRAs that modify existing
   YARN code base are YARN-3671 [7] & YARN-3673 [8]. We also paid close
   attention to ensure that if federation is disabled there is zero impact to
   existing functionality (disabled by default).
   - We found a few bugs as we went along which we fixed directly upstream
   in trunk and/or branch-2.
   - We have continuously rebasing the feature branch [2] so the merge
   should be a straightforward cherry-pick.
   - The current version has been rather thoroughly tested and is currently
   deployed in a *10,000+ node federated YARN cluster that's running
   upwards of 50k jobs daily with a reliability of 99.9%*.
   - We have few ideas for follow-up extensions/improvements which are
   tracked in the umbrella JIRA YARN-5597[3].


Documentation:

   - Quick start guide (maven site) - YARN-6484[4].
   - Overall design doc[5] and the slide-deck [6] we used for our talk at
   Hadoop Summit 2016 is available in the umbrella jira - YARN-2915.


Credits:

This is a group effort that could have not been possible without the ideas
and hard work of many other folks and we would like to specifically call
out Giovanni, Botong & Ellen for their invaluable contributions. Also big
thanks to the many folks in community  (Sriram, Kishore, Sarvesh, Jian,
Wangda, Karthik, Vinod, Varun, Inigo, Vrushali, Sangjin, Joep, Rohith and
many more) that helped us shape our ideas and code with very insightful
feedback and comments.

We plan to start the merge vote in the next week or so. The branch is close
to complete (~5 patches before one can kick the tires on a running
deployment). Please look through the branch; feedback is welcome. Thanks!

Cheers,
Subru & Carlo

[1] YARN-2915: https://issues.apache.org/jira/browse/YARN-2915
[2] https://github.com/apache/hadoop/tree/YARN-2915
[3] YARN-5597: https://issues.apache.org/jira/browse/YARN-5597
[4] YARN-6484: https://issues.apache.org/jira/browse/YARN-6484
[5] https://issues.apache.org/jira/secure/attachment/12733292/
Yarn_federation_design_v1.pdf
[6] https://issues.apache.org/jira/secure/attachment/1281922
9/YARN-Federation-Hadoop-Summit_final.pptx
[7] YARN-3671: https://issues.apache.org/jira/browse/YARN-3671
[8] YARN-3673: https://issues.apache.org/jira/browse/YARN-3673

Reply via email to