[
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711274#comment-14711274
]
Varun Saxena commented on YARN-3893:
------------------------------------
Hmm...my point of view based on the fact that the service cannot be up if
atleast one RM is not active. Standby RM is not going to serve anything
anyways.
Till configurations of this RM are not corrected, whether yarn-site or
scheduler configurations, this RM anyways cant become active (refreshAll will
always fail). And you can say there might be some silly mistake in scheduler
configuration too.
What we were doing before in the patch wont fill up the logs if configuration
is ok on other RM. And if its not Ok on other RM, logs will fill up even even
if refreshAll fails because of something other than scheduler config(and fail
fast is false).
fail fast by default is true, and if admin is making it false, he will know
what to expect.
But, you can say a RM shutting down is a far more alarming thing for an admin
and scheduler configurations more important. I agree with that. Maybe we can
make RM with wrong configuration down at all times. Because till he correct the
config(whether yarn-site or scheduler config), this RM cant become active.
Let us take opinion of couple of others as well on this. We can do whatever is
the consensus.
> Both RM in active state when Admin#transitionToActive failure from refeshAll()
> ------------------------------------------------------------------------------
>
> Key: YARN-3893
> URL: https://issues.apache.org/jira/browse/YARN-3893
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.7.1
> Reporter: Bibin A Chundatt
> Assignee: Bibin A Chundatt
> Priority: Critical
> Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch,
> 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch,
> yarn-site.xml
>
>
> Cases that can cause this.
> # Capacity scheduler xml is wrongly configured during switch
> # Refresh ACL failure due to configuration
> # Refresh User group failure due to configuration
> Continuously both RM will try to be active
> {code}
> dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
> ./yarn rmadmin -getServiceState rm1
> 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> active
> dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
> ./yarn rmadmin -getServiceState rm2
> 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> active
> {code}
> # Both Web UI active
> # Status shown as active for both RM
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)