+1. Thanks for delivering this important piece of work! I look forward to seeing it in the trunk.
Regards, Sangjin On Thu, Jul 27, 2017 at 2:49 PM, Botong Huang <[email protected]> wrote: > +1 (non-bindings) > > We have just deployed the latest bits (62f1ce2a3d9) from YARN-2915 in our > test cluster and ran multiple jobs. We confirm that Federation is working > e2e! > > Our cluster setup: eight sub-clusters, each with one RM and four NM nodes. > One Router machine. SQL Server in Ubuntu is used as FederationStateStore. > > Cheers, > > Botong > > On Thu, Jul 27, 2017 at 2:30 PM, Carlo Aldo Curino <[email protected] > > > wrote: > > > +1 > > > > Cheers, > > Carlo > > > > On Thu, Jul 27, 2017 at 12:45 PM, Arun Suresh <[email protected]> > wrote: > > > > > +1 > > > > > > Cheers > > > -Arun > > > > > > On Jul 25, 2017 8:24 PM, "Subru Krishnan" <[email protected]> wrote: > > > > > >> Hi all, > > >> > > >> Per earlier discussion [9], I'd like to start a formal vote to merge > > >> feature YARN Federation (YARN-2915) [1] to trunk. The vote will run > for > > 7 > > >> days, and will end Aug 1 7PM PDT. > > >> > > >> We have been developing the feature in a branch (YARN-2915 [2]) for a > > >> while, and we are reasonably confident that the state of the feature > > meets > > >> the criteria to be merged onto trunk. > > >> > > >> *Key Ideas*: > > >> > > >> YARN’s centralized design allows strict enforcement of scheduling > > >> invariants and effective resource sharing, but becomes a scalability > > >> bottleneck (in number of jobs and nodes) well before reaching the > scale > > of > > >> our clusters (e.g., 20k-50k nodes). > > >> > > >> > > >> To address these limitations, we developed a scale-out, > federation-based > > >> solution (YARN-2915). Our architecture scales near-linearly to > > datacenter > > >> sized clusters, by partitioning nodes across multiple sub-clusters > (each > > >> running a YARN cluster of few thousands nodes). Applications can span > > >> multiple sub-clusters *transparently (i.e. no code change or > > recompilation > > >> of existing apps)*, thanks to a layer of indirection that negotiates > > with > > >> multiple sub-clusters' Resource Managers on behalf of the application. > > >> > > >> > > >> This design is structurally scalable, as it bounds the number of nodes > > >> each > > >> RM is responsible for. Appropriate policies ensure that the majority > of > > >> applications reside within a single sub-cluster, thus further > > controlling > > >> the load on each RM. This provides near linear scale-out by simply > > adding > > >> more sub-clusters. The same mechanism enables pooling of resources > from > > >> clusters owned and operated by different teams. > > >> > > >> Status: > > >> > > >> - The version we would like to merge to trunk is termed "MVP" > > (minimal > > >> viable product). The feature will have a complete end-to-end > > >> application > > >> execution flow with the ability to span a single application across > > >> multiple YARN (sub) clusters. > > >> - There were 50+ sub-tasks that were that were completed as part of > > >> this > > >> effort. Every patch has been reviewed and +1ed by a committer. > Thanks > > >> to > > >> Jian, Wangda, Karthik, Vinod, Varun & Arun for the thorough > reviews! > > >> - Federation is designed to be built around YARN and consequently > has > > >> minimal code changes to core YARN. The relevant JIRAs that modify > > >> existing > > >> YARN code base are YARN-3671 [7] & YARN-3673 [8]. We also paid > close > > >> attention to ensure that if federation is disabled there is zero > > >> impact to > > >> existing functionality (disabled by default). > > >> - We found a few bugs as we went along which we fixed directly > > upstream > > >> in trunk and/or branch-2. > > >> - We have continuously rebasing the feature branch [2] so the merge > > >> should be a straightforward cherry-pick. > > >> - The current version has been rather thoroughly tested and is > > >> currently > > >> deployed in a *10,000+ node federated YARN cluster that's running > > >> upwards of 50k jobs daily with a reliability of 99.9%*. > > >> - We have few ideas for follow-up extensions/improvements which are > > >> tracked in the umbrella JIRA YARN-5597[3]. > > >> > > >> > > >> Documentation: > > >> > > >> - Quick start guide (maven site) - YARN-6484[4]. > > >> - Overall design doc[5] and the slide-deck [6] we used for our talk > > at > > >> Hadoop Summit 2016 is available in the umbrella jira - YARN-2915. > > >> > > >> > > >> Credits: > > >> > > >> This is a group effort that could have not been possible without the > > ideas > > >> and hard work of many other folks and we would like to specifically > call > > >> out Giovanni, Botong & Ellen for their invaluable contributions. Also > > big > > >> thanks to the many folks in community (Sriram, Kishore, Sarvesh, > Jian, > > >> Wangda, Karthik, Vinod, Varun, Inigo, Vrushali, Sangjin, Joep, Rohith > > and > > >> many more) that helped us shape our ideas and code with very > insightful > > >> feedback and comments. > > >> > > >> Cheers, > > >> Subru & Carlo > > >> > > >> [1] YARN-2915: https://issues.apache.org/jira/browse/YARN-2915 > > >> [2] https://github.com/apache/hadoop/tree/YARN-2915 > > >> [3] YARN-5597: https://issues.apache.org/jira/browse/YARN-5597 > > >> [4] YARN-6484: https://issues.apache.org/jira/browse/YARN-6484 > > >> [5] https://issues.apache.org/jira/secure/attachment/12733292/Ya > > >> rn_federation_design_v1.pdf > > >> [6] https://issues.apache.org/jira/secure/attachment/1281922 > > >> 9/YARN-Federation-Hadoop-Summit_final.pptx > > >> [7] YARN-3671: https://issues.apache.org/jira/browse/YARN-3671 > > >> [8] YARN-3673: https://issues.apache.org/jira/browse/YARN-3673 > > >> [9] > > >> http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201 > > >> 706.mbox/%3CCAOScs9bSsZ7mzH15Y%2BSPDU8YuNUAq7QicjXpDoX_tKh3M > > >> S4HsA%40mail.gmail.com%3E > > >> > > > > > >
