Thanks for the quick response Subru! Here's my +0, FWIW.
On Mon, Jul 31, 2017 at 3:14 PM, Subru Krishnan <[email protected]> wrote: > Hi Andrew, > > You are raising pertinent questions: one of the key design points of > Federation was to be completely transparent to applications, i.e. there > should no code change or even recompile required to run existing apps in a > federated cluster. In summary apps simply get the appearance of a larger > cluster to play around with. Consequently there are zero public API changes > (we have new APIs for FederationStateStore but those are purely private) > for YARN Federation. Additionally we have backported the code to our > internal branch (currently based on 2.7.1) and have been running in > production at scale of 10s of 1000s of nodes. > > I agree with you regarding the backport to branch-2. We are planning to > get that done by August and hence included it in the proposed release > plan[1] for 2.9.0. > > Cheers, > Subru > > [1] https://www.mail-archive.com/[email protected]/msg27126.html > > > > On Mon, Jul 31, 2017 at 1:29 PM, Andrew Wang <[email protected]> > wrote: > >> Hi all, >> >> Sorry for coming to this late, I wasn't on yarn-dev and someone else >> mentioned that this feature was being merged. >> >> With my RM hat on, trunk is an active release branch, so we want to be >> merging features when they are production-ready. This feature has done one >> better, and has already been run at 10k-node scale! It's great to see this >> level of testing and validation for a branch merge. >> >> Could one of the contributors comment on compatibility and API stability? >> It looks like it's compatible and stable, but I wanted to confirm since >> the >> target 3.0.0-beta1 release date of mid-September means there isn't much >> time to do additional development in trunk. >> >> Finally, could someone comment on the timeline for merging this into >> branch-2? Given that the feature seems ready, I expect we'd quickly >> backport this to branch-2 as well. >> >> Best, >> Andrew >> >> On Mon, Jul 31, 2017 at 1:05 PM, Naganarasimha Garla < >> [email protected]> wrote: >> >> > +1, Quite interesting and useful feature. Hoping to see it 2.9 too. >> > >> > On Tue, Aug 1, 2017 at 1:31 AM, Jason Lowe <[email protected] >> > >> > wrote: >> > >> > > +1 >> > > Jason >> > > >> > > >> > > On Tuesday, July 25, 2017 10:24 PM, Subru Krishnan < >> [email protected] >> > > >> > > wrote: >> > > >> > > >> > > Hi all, >> > > >> > > Per earlier discussion [9], I'd like to start a formal vote to merge >> > > feature YARN Federation (YARN-2915) [1] to trunk. The vote will run >> for 7 >> > > days, and will end Aug 1 7PM PDT. >> > > >> > > We have been developing the feature in a branch (YARN-2915 [2]) for a >> > > while, and we are reasonably confident that the state of the feature >> > meets >> > > the criteria to be merged onto trunk. >> > > >> > > *Key Ideas*: >> > > >> > > YARN’s centralized design allows strict enforcement of scheduling >> > > invariants and effective resource sharing, but becomes a scalability >> > > bottleneck (in number of jobs and nodes) well before reaching the >> scale >> > of >> > > our clusters (e.g., 20k-50k nodes). >> > > >> > > >> > > To address these limitations, we developed a scale-out, >> federation-based >> > > solution (YARN-2915). Our architecture scales near-linearly to >> datacenter >> > > sized clusters, by partitioning nodes across multiple sub-clusters >> (each >> > > running a YARN cluster of few thousands nodes). Applications can span >> > > multiple sub-clusters *transparently (i.e. no code change or >> > recompilation >> > > of existing apps)*, thanks to a layer of indirection that negotiates >> with >> > > multiple sub-clusters' Resource Managers on behalf of the application. >> > > >> > > >> > > This design is structurally scalable, as it bounds the number of nodes >> > each >> > > RM is responsible for. Appropriate policies ensure that the majority >> of >> > > applications reside within a single sub-cluster, thus further >> controlling >> > > the load on each RM. This provides near linear scale-out by simply >> adding >> > > more sub-clusters. The same mechanism enables pooling of resources >> from >> > > clusters owned and operated by different teams. >> > > >> > > Status: >> > > >> > > - The version we would like to merge to trunk is termed "MVP" >> (minimal >> > > viable product). The feature will have a complete end-to-end >> > application >> > > execution flow with the ability to span a single application across >> > > multiple YARN (sub) clusters. >> > > - There were 50+ sub-tasks that were that were completed as part of >> > this >> > > effort. Every patch has been reviewed and +1ed by a committer. >> Thanks >> > to >> > > Jian, Wangda, Karthik, Vinod, Varun & Arun for the thorough reviews! >> > > - Federation is designed to be built around YARN and consequently >> has >> > > minimal code changes to core YARN. The relevant JIRAs that modify >> > > existing >> > > YARN code base are YARN-3671 [7] & YARN-3673 [8]. We also paid close >> > > attention to ensure that if federation is disabled there is zero >> impact >> > > to >> > > existing functionality (disabled by default). >> > > - We found a few bugs as we went along which we fixed directly >> upstream >> > > in trunk and/or branch-2. >> > > - We have continuously rebasing the feature branch [2] so the merge >> > > should be a straightforward cherry-pick. >> > > - The current version has been rather thoroughly tested and is >> > currently >> > > deployed in a *10,000+ node federated YARN cluster that's running >> > > upwards of 50k jobs daily with a reliability of 99.9%*. >> > > - We have few ideas for follow-up extensions/improvements which are >> > > tracked in the umbrella JIRA YARN-5597[3]. >> > > >> > > >> > > Documentation: >> > > >> > > - Quick start guide (maven site) - YARN-6484[4]. >> > > - Overall design doc[5] and the slide-deck [6] we used for our talk >> at >> > > Hadoop Summit 2016 is available in the umbrella jira - YARN-2915. >> > > >> > > >> > > Credits: >> > > >> > > This is a group effort that could have not been possible without the >> > ideas >> > > and hard work of many other folks and we would like to specifically >> call >> > > out Giovanni, Botong & Ellen for their invaluable contributions. Also >> big >> > > thanks to the many folks in community (Sriram, Kishore, Sarvesh, >> Jian, >> > > Wangda, Karthik, Vinod, Varun, Inigo, Vrushali, Sangjin, Joep, Rohith >> and >> > > many more) that helped us shape our ideas and code with very >> insightful >> > > feedback and comments. >> > > >> > > Cheers, >> > > Subru & Carlo >> > > >> > > [1] YARN-2915: https://issues.apache.org/jira/browse/YARN-2915 >> > > [2] https://github.com/apache/hadoop/tree/YARN-2915 >> > > [3] YARN-5597: https://issues.apache.org/jira/browse/YARN-5597 >> > > [4] YARN-6484: https://issues.apache.org/jira/browse/YARN-6484 >> > > [5] https://issues.apache.org/jira/secure/attachment/12733292/Ya >> > > rn_federation_design_v1.pdf >> > > [6] https://issues.apache.org/jira/secure/attachment/1281922 >> > > 9/YARN-Federation-Hadoop-Summit_final.pptx >> > > [7] YARN-3671: https://issues.apache.org/jira/browse/YARN-3671 >> > > [8] YARN-3673: https://issues.apache.org/jira/browse/YARN-3673 >> > > [9] >> > > http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201 >> 706.mbox/% >> > > 3CCAOScs9bSsZ7mzH15Y%2BSPDU8YuNUAq7QicjXpDoX_ >> > > tKh3MS4HsA%40mail.gmail.com%3E >> > > >> > > >> > > >> > >> > >
