[
https://issues.apache.org/jira/browse/YARN-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13948751#comment-13948751
]
Vinod Kumar Vavilapalli commented on YARN-1696:
-----------------------------------------------
Tx for the doc, Karthik. Some comments:
- Like I mentioned, fail-over is a big enough topic in itself and so let's
split this into two two and call this one the ResourceManager fail-over guide.
We can have a top level high-availability doc if we want to and link the two
there.
- Let's move off the state-store and RM restart stuff out.
- "the applications can resume from their last check-pointed state; e.g.
completed map tasks in a MapReduce job are not re-run on a subsequent attempt"
-> This is not related to fail-over. Let's put it in the restart doc.
- " Clients, ApplicationMasters (AMs) and NodeManagers (NMs) try connecting to
the RMsin a round-robin fashion" -> Or point that we have
ConfigFailOverProvider as the default implementation of an abstraction?
- I think we should mention that even though there are two state-store impls,
the suggested store is ZK-based store for the sake of fencing.
- We should also document the client retry related configs.
- Should we give a very basic example configuration of two RMs? The absolute
minimum required to enable this?
Unrelated to the docs
- It's late, but after seeing the document, I think we should rename
"yarn.resourcemanager.ha." configs to be "yarn.resourcemanager.failover.". What
do others think? Also "rm-ids" is seems weird too.
> Document RM HA
> --------------
>
> Key: YARN-1696
> URL: https://issues.apache.org/jira/browse/YARN-1696
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.3.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Blocker
> Attachments: YARN-1696.2.patch, yarn-1696-1.patch
>
>
> Add documentation for RM HA. Marking this a blocker for 2.4 as this is
> required to call RM HA Stable and ready for public consumption.
--
This message was sent by Atlassian JIRA
(v6.2#6252)