Thanks for the quick response Subru!

Here's my +0, FWIW.

On Mon, Jul 31, 2017 at 3:14 PM, Subru Krishnan <[email protected]> wrote:

> Hi Andrew,
>
> You are raising pertinent questions: one of the key design points of
> Federation was to be completely transparent to applications, i.e. there
> should no code change or even recompile required to run existing apps in a
> federated cluster. In summary apps simply get the appearance of a larger
> cluster to play around with. Consequently there are zero public API changes
> (we have new APIs for FederationStateStore but those are purely private)
> for YARN Federation. Additionally we have backported the code to our
> internal branch (currently based on 2.7.1) and have been running in
> production at scale of 10s of 1000s of nodes.
>
> I agree with you regarding the backport to branch-2. We are planning to
> get that done by August and hence included it in the proposed release
> plan[1] for 2.9.0.
>
> Cheers,
> Subru
>
> [1] https://www.mail-archive.com/[email protected]/msg27126.html
>
>
>
> On Mon, Jul 31, 2017 at 1:29 PM, Andrew Wang <[email protected]>
> wrote:
>
>> Hi all,
>>
>> Sorry for coming to this late, I wasn't on yarn-dev and someone else
>> mentioned that this feature was being merged.
>>
>> With my RM hat on, trunk is an active release branch, so we want to be
>> merging features when they are production-ready. This feature has done one
>> better, and has already been run at 10k-node scale! It's great to see this
>> level of testing and validation for a branch merge.
>>
>> Could one of the contributors comment on compatibility and API stability?
>> It looks like it's compatible and stable, but I wanted to confirm since
>> the
>> target 3.0.0-beta1 release date of mid-September means there isn't much
>> time to do additional development in trunk.
>>
>> Finally, could someone comment on the timeline for merging this into
>> branch-2? Given that the feature seems ready, I expect we'd quickly
>> backport this to branch-2 as well.
>>
>> Best,
>> Andrew
>>
>> On Mon, Jul 31, 2017 at 1:05 PM, Naganarasimha Garla <
>> [email protected]> wrote:
>>
>> > +1, Quite interesting and useful feature. Hoping to see it 2.9 too.
>> >
>> > On Tue, Aug 1, 2017 at 1:31 AM, Jason Lowe <[email protected]
>> >
>> > wrote:
>> >
>> > > +1
>> > > Jason
>> > >
>> > >
>> > >     On Tuesday, July 25, 2017 10:24 PM, Subru Krishnan <
>> [email protected]
>> > >
>> > > wrote:
>> > >
>> > >
>> > >  Hi all,
>> > >
>> > > Per earlier discussion [9], I'd like to start a formal vote to merge
>> > > feature YARN Federation (YARN-2915) [1] to trunk. The vote will run
>> for 7
>> > > days, and will end Aug 1 7PM PDT.
>> > >
>> > > We have been developing the feature in a branch (YARN-2915 [2]) for a
>> > > while, and we are reasonably confident that the state of the feature
>> > meets
>> > > the criteria to be merged onto trunk.
>> > >
>> > > *Key Ideas*:
>> > >
>> > > YARN’s centralized design allows strict enforcement of scheduling
>> > > invariants and effective resource sharing, but becomes a scalability
>> > > bottleneck (in number of jobs and nodes) well before reaching the
>> scale
>> > of
>> > > our clusters (e.g., 20k-50k nodes).
>> > >
>> > >
>> > > To address these limitations, we developed a scale-out,
>> federation-based
>> > > solution (YARN-2915). Our architecture scales near-linearly to
>> datacenter
>> > > sized clusters, by partitioning nodes across multiple sub-clusters
>> (each
>> > > running a YARN cluster of few thousands nodes). Applications can span
>> > > multiple sub-clusters *transparently (i.e. no code change or
>> > recompilation
>> > > of existing apps)*, thanks to a layer of indirection that negotiates
>> with
>> > > multiple sub-clusters' Resource Managers on behalf of the application.
>> > >
>> > >
>> > > This design is structurally scalable, as it bounds the number of nodes
>> > each
>> > > RM is responsible for. Appropriate policies ensure that the majority
>> of
>> > > applications reside within a single sub-cluster, thus further
>> controlling
>> > > the load on each RM. This provides near linear scale-out by simply
>> adding
>> > > more sub-clusters. The same mechanism enables pooling of resources
>> from
>> > > clusters owned and operated by different teams.
>> > >
>> > > Status:
>> > >
>> > >   - The version we would like to merge to trunk is termed "MVP"
>> (minimal
>> > >   viable product). The feature will have a complete end-to-end
>> > application
>> > >   execution flow with the ability to span a single application across
>> > >   multiple YARN (sub) clusters.
>> > >   - There were 50+ sub-tasks that were that were completed as part of
>> > this
>> > >   effort. Every patch has been reviewed and +1ed by a committer.
>> Thanks
>> > to
>> > >   Jian, Wangda, Karthik, Vinod, Varun & Arun for the thorough reviews!
>> > >   - Federation is designed to be built around YARN and consequently
>> has
>> > >   minimal code changes to core YARN. The relevant JIRAs that modify
>> > > existing
>> > >   YARN code base are YARN-3671 [7] & YARN-3673 [8]. We also paid close
>> > >   attention to ensure that if federation is disabled there is zero
>> impact
>> > > to
>> > >   existing functionality (disabled by default).
>> > >   - We found a few bugs as we went along which we fixed directly
>> upstream
>> > >   in trunk and/or branch-2.
>> > >   - We have continuously rebasing the feature branch [2] so the merge
>> > >   should be a straightforward cherry-pick.
>> > >   - The current version has been rather thoroughly tested and is
>> > currently
>> > >   deployed in a *10,000+ node federated YARN cluster that's running
>> > >   upwards of 50k jobs daily with a reliability of 99.9%*.
>> > >   - We have few ideas for follow-up extensions/improvements which are
>> > >   tracked in the umbrella JIRA YARN-5597[3].
>> > >
>> > >
>> > > Documentation:
>> > >
>> > >   - Quick start guide (maven site) - YARN-6484[4].
>> > >   - Overall design doc[5] and the slide-deck [6] we used for our talk
>> at
>> > >   Hadoop Summit 2016 is available in the umbrella jira - YARN-2915.
>> > >
>> > >
>> > > Credits:
>> > >
>> > > This is a group effort that could have not been possible without the
>> > ideas
>> > > and hard work of many other folks and we would like to specifically
>> call
>> > > out Giovanni, Botong & Ellen for their invaluable contributions. Also
>> big
>> > > thanks to the many folks in community  (Sriram, Kishore, Sarvesh,
>> Jian,
>> > > Wangda, Karthik, Vinod, Varun, Inigo, Vrushali, Sangjin, Joep, Rohith
>> and
>> > > many more) that helped us shape our ideas and code with very
>> insightful
>> > > feedback and comments.
>> > >
>> > > Cheers,
>> > > Subru & Carlo
>> > >
>> > > [1] YARN-2915: https://issues.apache.org/jira/browse/YARN-2915
>> > > [2] https://github.com/apache/hadoop/tree/YARN-2915
>> > > [3] YARN-5597: https://issues.apache.org/jira/browse/YARN-5597
>> > > [4] YARN-6484: https://issues.apache.org/jira/browse/YARN-6484
>> > > [5] https://issues.apache.org/jira/secure/attachment/12733292/Ya
>> > > rn_federation_design_v1.pdf
>> > > [6] https://issues.apache.org/jira/secure/attachment/1281922
>> > > 9/YARN-Federation-Hadoop-Summit_final.pptx
>> > > [7] YARN-3671: https://issues.apache.org/jira/browse/YARN-3671
>> > > [8] YARN-3673: https://issues.apache.org/jira/browse/YARN-3673
>> > > [9]
>> > > http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201
>> 706.mbox/%
>> > > 3CCAOScs9bSsZ7mzH15Y%2BSPDU8YuNUAq7QicjXpDoX_
>> > > tKh3MS4HsA%40mail.gmail.com%3E
>> > >
>> > >
>> > >
>> >
>>
>
>

Reply via email to