[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

Varun Saxena (JIRA) Tue, 25 Aug 2015 06:45:18 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711274#comment-14711274
 ]


Varun Saxena commented on YARN-3893:
------------------------------------

Hmm...my point of view based on the fact that the service cannot be up if 
atleast one RM is not active. Standby RM is not going to serve anything 
anyways. 
Till configurations of this RM are not corrected, whether yarn-site or 
scheduler configurations, this RM anyways cant become active (refreshAll will 
always fail). And you can say there might be some silly mistake in scheduler 
configuration too.

What we were doing before in the patch wont fill up the logs if configuration 
is ok on other RM. And if its not Ok on other RM, logs will fill up even even 
if refreshAll fails because of something other than scheduler config(and fail 
fast is false).
fail fast by default is true, and if admin is making it false, he will know 
what to expect. 
 
But, you can say a RM shutting down is a far more alarming thing for an admin 
and scheduler configurations more important. I agree with that. Maybe we can 
make RM with wrong configuration down at all times. Because till he correct the 
config(whether yarn-site or scheduler config), this RM cant become active.

Let us take opinion of couple of others as well on this. We can do whatever is 
the consensus.

> Both RM in active state when Admin#transitionToActive failure from refeshAll()
> ------------------------------------------------------------------------------
>
>                 Key: YARN-3893
>                 URL: https://issues.apache.org/jira/browse/YARN-3893
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
> 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 
> yarn-site.xml
>
>
> Cases that can cause this.
> # Capacity scheduler xml is wrongly configured during switch
> # Refresh ACL failure due to configuration
> # Refresh User group failure due to configuration
> Continuously both RM will try to be active
> {code}
> dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm1
> 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm2
> 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> {code}
> # Both Web UI active
> # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

Reply via email to