+1.

Thanks for delivering this important piece of work! I look forward to
seeing it in the trunk.

Regards,
Sangjin

On Thu, Jul 27, 2017 at 2:49 PM, Botong Huang <[email protected]> wrote:

> +1 (non-bindings)
>
> We have just deployed the latest bits (62f1ce2a3d9) from YARN-2915 in our
> test cluster and ran multiple jobs. We confirm that Federation is working
> e2e!
>
> Our cluster setup: eight sub-clusters, each with one RM and four NM nodes.
> One Router machine. SQL Server in Ubuntu is used as FederationStateStore.
>
> Cheers,
>
> Botong
>
> On Thu, Jul 27, 2017 at 2:30 PM, Carlo Aldo Curino <[email protected]
> >
> wrote:
>
> > +1
> >
> > Cheers,
> > Carlo
> >
> > On Thu, Jul 27, 2017 at 12:45 PM, Arun Suresh <[email protected]>
> wrote:
> >
> > > +1
> > >
> > > Cheers
> > > -Arun
> > >
> > > On Jul 25, 2017 8:24 PM, "Subru Krishnan" <[email protected]> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Per earlier discussion [9], I'd like to start a formal vote to merge
> > >> feature YARN Federation (YARN-2915) [1] to trunk. The vote will run
> for
> > 7
> > >> days, and will end Aug 1 7PM PDT.
> > >>
> > >> We have been developing the feature in a branch (YARN-2915 [2]) for a
> > >> while, and we are reasonably confident that the state of the feature
> > meets
> > >> the criteria to be merged onto trunk.
> > >>
> > >> *Key Ideas*:
> > >>
> > >> YARN’s centralized design allows strict enforcement of scheduling
> > >> invariants and effective resource sharing, but becomes a scalability
> > >> bottleneck (in number of jobs and nodes) well before reaching the
> scale
> > of
> > >> our clusters (e.g., 20k-50k nodes).
> > >>
> > >>
> > >> To address these limitations, we developed a scale-out,
> federation-based
> > >> solution (YARN-2915). Our architecture scales near-linearly to
> > datacenter
> > >> sized clusters, by partitioning nodes across multiple sub-clusters
> (each
> > >> running a YARN cluster of few thousands nodes). Applications can span
> > >> multiple sub-clusters *transparently (i.e. no code change or
> > recompilation
> > >> of existing apps)*, thanks to a layer of indirection that negotiates
> > with
> > >> multiple sub-clusters' Resource Managers on behalf of the application.
> > >>
> > >>
> > >> This design is structurally scalable, as it bounds the number of nodes
> > >> each
> > >> RM is responsible for. Appropriate policies ensure that the majority
> of
> > >> applications reside within a single sub-cluster, thus further
> > controlling
> > >> the load on each RM. This provides near linear scale-out by simply
> > adding
> > >> more sub-clusters. The same mechanism enables pooling of resources
> from
> > >> clusters owned and operated by different teams.
> > >>
> > >> Status:
> > >>
> > >>    - The version we would like to merge to trunk is termed "MVP"
> > (minimal
> > >>    viable product). The feature will have a complete end-to-end
> > >> application
> > >>    execution flow with the ability to span a single application across
> > >>    multiple YARN (sub) clusters.
> > >>    - There were 50+ sub-tasks that were that were completed as part of
> > >> this
> > >>    effort. Every patch has been reviewed and +1ed by a committer.
> Thanks
> > >> to
> > >>    Jian, Wangda, Karthik, Vinod, Varun & Arun for the thorough
> reviews!
> > >>    - Federation is designed to be built around YARN and consequently
> has
> > >>    minimal code changes to core YARN. The relevant JIRAs that modify
> > >> existing
> > >>    YARN code base are YARN-3671 [7] & YARN-3673 [8]. We also paid
> close
> > >>    attention to ensure that if federation is disabled there is zero
> > >> impact to
> > >>    existing functionality (disabled by default).
> > >>    - We found a few bugs as we went along which we fixed directly
> > upstream
> > >>    in trunk and/or branch-2.
> > >>    - We have continuously rebasing the feature branch [2] so the merge
> > >>    should be a straightforward cherry-pick.
> > >>    - The current version has been rather thoroughly tested and is
> > >> currently
> > >>    deployed in a *10,000+ node federated YARN cluster that's running
> > >>    upwards of 50k jobs daily with a reliability of 99.9%*.
> > >>    - We have few ideas for follow-up extensions/improvements which are
> > >>    tracked in the umbrella JIRA YARN-5597[3].
> > >>
> > >>
> > >> Documentation:
> > >>
> > >>    - Quick start guide (maven site) - YARN-6484[4].
> > >>    - Overall design doc[5] and the slide-deck [6] we used for our talk
> > at
> > >>    Hadoop Summit 2016 is available in the umbrella jira - YARN-2915.
> > >>
> > >>
> > >> Credits:
> > >>
> > >> This is a group effort that could have not been possible without the
> > ideas
> > >> and hard work of many other folks and we would like to specifically
> call
> > >> out Giovanni, Botong & Ellen for their invaluable contributions. Also
> > big
> > >> thanks to the many folks in community  (Sriram, Kishore, Sarvesh,
> Jian,
> > >> Wangda, Karthik, Vinod, Varun, Inigo, Vrushali, Sangjin, Joep, Rohith
> > and
> > >> many more) that helped us shape our ideas and code with very
> insightful
> > >> feedback and comments.
> > >>
> > >> Cheers,
> > >> Subru & Carlo
> > >>
> > >> [1] YARN-2915: https://issues.apache.org/jira/browse/YARN-2915
> > >> [2] https://github.com/apache/hadoop/tree/YARN-2915
> > >> [3] YARN-5597: https://issues.apache.org/jira/browse/YARN-5597
> > >> [4] YARN-6484: https://issues.apache.org/jira/browse/YARN-6484
> > >> [5] https://issues.apache.org/jira/secure/attachment/12733292/Ya
> > >> rn_federation_design_v1.pdf
> > >> [6] https://issues.apache.org/jira/secure/attachment/1281922
> > >> 9/YARN-Federation-Hadoop-Summit_final.pptx
> > >> [7] YARN-3671: https://issues.apache.org/jira/browse/YARN-3671
> > >> [8] YARN-3673: https://issues.apache.org/jira/browse/YARN-3673
> > >> [9]
> > >> http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201
> > >> 706.mbox/%3CCAOScs9bSsZ7mzH15Y%2BSPDU8YuNUAq7QicjXpDoX_tKh3M
> > >> S4HsA%40mail.gmail.com%3E
> > >>
> > >
> >
>

Reply via email to