[ https://issues.apache.org/jira/browse/YARN-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232144#comment-14232144 ]
Carlo Curino commented on YARN-2915: ------------------------------------ A couple of design principles at play: # We are designing federation so that it requires minimal changes to YARN. # We are trying hard to make federation completely transparent to applications. # We are investigating uses of federation that could facilitate maintenance / fault-tolerance / sub-cluster customization. Regarding (1), in the context of cluster pooling / private cloud idea mentioned above, the clusters being pooled can be (largely) unaware of the fact that are being federated together, as all/most public protocols are unmodified, and the AMRMProxy of YARN-2884 can be run only on a small cluster that work as a launch pad for the federation. Regarding (2), it seems plausible (we have a working prototype) to make the federation transparent to the applications, but more analysis of security, load balancing, and HA aspects is required. Regarding (3) federation should facilitate upgrades of each sub-cluster, can be made more fault-tolerant, by having the routing layer to fall-back on a secondary clusters upon cluster-wide failures, and could be leveraged for customization (e.g., run a smaller cluster with very fast heartbeats and a bigger cluster with slower heartbeat, and pull them together on demand). > Enable YARN RM scale out via federation using multiple RM's > ----------------------------------------------------------- > > Key: YARN-2915 > URL: https://issues.apache.org/jira/browse/YARN-2915 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager > Reporter: Sriram Rao > Assignee: Subru Krishnan > > This is an umbrella JIRA that proposes to scale out YARN to support large > clusters comprising of tens of thousands of nodes. That is, rather than > limiting a YARN managed cluster to about 4k in size, the proposal is to > enable the YARN managed cluster to be elastically scalable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)