[ 
https://issues.apache.org/jira/browse/YARN-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13948751#comment-13948751
 ] 

Vinod Kumar Vavilapalli commented on YARN-1696:
-----------------------------------------------

Tx for the doc, Karthik. Some comments:
 - Like I mentioned, fail-over is a big enough topic in itself and so let's 
split this into two two and call this one the ResourceManager fail-over guide. 
We can have a top level high-availability doc if we want to and link the two 
there.
 - Let's move off the state-store and RM restart stuff out.
 - "the applications can resume from their last check-pointed state; e.g. 
completed map tasks in a MapReduce job are not re-run on a subsequent attempt" 
-> This is not related to fail-over. Let's put it in the restart doc.
 - " Clients, ApplicationMasters (AMs) and NodeManagers (NMs) try connecting to 
the RMsin a round-robin fashion" -> Or point that we have 
ConfigFailOverProvider as the default implementation of an abstraction?
 - I think we should mention that even though there are two state-store impls, 
the suggested store is ZK-based store for the sake of fencing.
 - We should also document the client retry related configs.
 - Should we give a very basic example configuration of two RMs? The absolute 
minimum required to enable this?

Unrelated to the docs
 - It's late, but after seeing the document, I think we should rename 
"yarn.resourcemanager.ha." configs to be "yarn.resourcemanager.failover.". What 
do others think? Also "rm-ids" is seems weird too.

> Document RM HA
> --------------
>
>                 Key: YARN-1696
>                 URL: https://issues.apache.org/jira/browse/YARN-1696
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-1696.2.patch, yarn-1696-1.patch
>
>
> Add documentation for RM HA. Marking this a blocker for 2.4 as this is 
> required to call RM HA Stable and ready for public consumption. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to