Hi Andrew, You are raising pertinent questions: one of the key design points of Federation was to be completely transparent to applications, i.e. there should no code change or even recompile required to run existing apps in a federated cluster. In summary apps simply get the appearance of a larger cluster to play around with. Consequently there are zero public API changes (we have new APIs for FederationStateStore but those are purely private) for YARN Federation. Additionally we have backported the code to our internal branch (currently based on 2.7.1) and have been running in production at scale of 10s of 1000s of nodes.
I agree with you regarding the backport to branch-2. We are planning to get that done by August and hence included it in the proposed release plan[1] for 2.9.0. Cheers, Subru [1] https://www.mail-archive.com/[email protected]/msg27126.html On Mon, Jul 31, 2017 at 1:29 PM, Andrew Wang <[email protected]> wrote: > Hi all, > > Sorry for coming to this late, I wasn't on yarn-dev and someone else > mentioned that this feature was being merged. > > With my RM hat on, trunk is an active release branch, so we want to be > merging features when they are production-ready. This feature has done one > better, and has already been run at 10k-node scale! It's great to see this > level of testing and validation for a branch merge. > > Could one of the contributors comment on compatibility and API stability? > It looks like it's compatible and stable, but I wanted to confirm since the > target 3.0.0-beta1 release date of mid-September means there isn't much > time to do additional development in trunk. > > Finally, could someone comment on the timeline for merging this into > branch-2? Given that the feature seems ready, I expect we'd quickly > backport this to branch-2 as well. > > Best, > Andrew > > On Mon, Jul 31, 2017 at 1:05 PM, Naganarasimha Garla < > [email protected]> wrote: > > > +1, Quite interesting and useful feature. Hoping to see it 2.9 too. > > > > On Tue, Aug 1, 2017 at 1:31 AM, Jason Lowe <[email protected]> > > wrote: > > > > > +1 > > > Jason > > > > > > > > > On Tuesday, July 25, 2017 10:24 PM, Subru Krishnan < > [email protected] > > > > > > wrote: > > > > > > > > > Hi all, > > > > > > Per earlier discussion [9], I'd like to start a formal vote to merge > > > feature YARN Federation (YARN-2915) [1] to trunk. The vote will run > for 7 > > > days, and will end Aug 1 7PM PDT. > > > > > > We have been developing the feature in a branch (YARN-2915 [2]) for a > > > while, and we are reasonably confident that the state of the feature > > meets > > > the criteria to be merged onto trunk. > > > > > > *Key Ideas*: > > > > > > YARN’s centralized design allows strict enforcement of scheduling > > > invariants and effective resource sharing, but becomes a scalability > > > bottleneck (in number of jobs and nodes) well before reaching the scale > > of > > > our clusters (e.g., 20k-50k nodes). > > > > > > > > > To address these limitations, we developed a scale-out, > federation-based > > > solution (YARN-2915). Our architecture scales near-linearly to > datacenter > > > sized clusters, by partitioning nodes across multiple sub-clusters > (each > > > running a YARN cluster of few thousands nodes). Applications can span > > > multiple sub-clusters *transparently (i.e. no code change or > > recompilation > > > of existing apps)*, thanks to a layer of indirection that negotiates > with > > > multiple sub-clusters' Resource Managers on behalf of the application. > > > > > > > > > This design is structurally scalable, as it bounds the number of nodes > > each > > > RM is responsible for. Appropriate policies ensure that the majority of > > > applications reside within a single sub-cluster, thus further > controlling > > > the load on each RM. This provides near linear scale-out by simply > adding > > > more sub-clusters. The same mechanism enables pooling of resources from > > > clusters owned and operated by different teams. > > > > > > Status: > > > > > > - The version we would like to merge to trunk is termed "MVP" > (minimal > > > viable product). The feature will have a complete end-to-end > > application > > > execution flow with the ability to span a single application across > > > multiple YARN (sub) clusters. > > > - There were 50+ sub-tasks that were that were completed as part of > > this > > > effort. Every patch has been reviewed and +1ed by a committer. Thanks > > to > > > Jian, Wangda, Karthik, Vinod, Varun & Arun for the thorough reviews! > > > - Federation is designed to be built around YARN and consequently has > > > minimal code changes to core YARN. The relevant JIRAs that modify > > > existing > > > YARN code base are YARN-3671 [7] & YARN-3673 [8]. We also paid close > > > attention to ensure that if federation is disabled there is zero > impact > > > to > > > existing functionality (disabled by default). > > > - We found a few bugs as we went along which we fixed directly > upstream > > > in trunk and/or branch-2. > > > - We have continuously rebasing the feature branch [2] so the merge > > > should be a straightforward cherry-pick. > > > - The current version has been rather thoroughly tested and is > > currently > > > deployed in a *10,000+ node federated YARN cluster that's running > > > upwards of 50k jobs daily with a reliability of 99.9%*. > > > - We have few ideas for follow-up extensions/improvements which are > > > tracked in the umbrella JIRA YARN-5597[3]. > > > > > > > > > Documentation: > > > > > > - Quick start guide (maven site) - YARN-6484[4]. > > > - Overall design doc[5] and the slide-deck [6] we used for our talk > at > > > Hadoop Summit 2016 is available in the umbrella jira - YARN-2915. > > > > > > > > > Credits: > > > > > > This is a group effort that could have not been possible without the > > ideas > > > and hard work of many other folks and we would like to specifically > call > > > out Giovanni, Botong & Ellen for their invaluable contributions. Also > big > > > thanks to the many folks in community (Sriram, Kishore, Sarvesh, Jian, > > > Wangda, Karthik, Vinod, Varun, Inigo, Vrushali, Sangjin, Joep, Rohith > and > > > many more) that helped us shape our ideas and code with very insightful > > > feedback and comments. > > > > > > Cheers, > > > Subru & Carlo > > > > > > [1] YARN-2915: https://issues.apache.org/jira/browse/YARN-2915 > > > [2] https://github.com/apache/hadoop/tree/YARN-2915 > > > [3] YARN-5597: https://issues.apache.org/jira/browse/YARN-5597 > > > [4] YARN-6484: https://issues.apache.org/jira/browse/YARN-6484 > > > [5] https://issues.apache.org/jira/secure/attachment/12733292/Ya > > > rn_federation_design_v1.pdf > > > [6] https://issues.apache.org/jira/secure/attachment/1281922 > > > 9/YARN-Federation-Hadoop-Summit_final.pptx > > > [7] YARN-3671: https://issues.apache.org/jira/browse/YARN-3671 > > > [8] YARN-3673: https://issues.apache.org/jira/browse/YARN-3673 > > > [9] > > > http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201706.mbox/% > > > 3CCAOScs9bSsZ7mzH15Y%2BSPDU8YuNUAq7QicjXpDoX_ > > > tKh3MS4HsA%40mail.gmail.com%3E > > > > > > > > > > > >
