Varun Saxena commented on YARN-3893:

We can do the cleanup(i.e. stop active services) when we switch to standby. We 
do this already. Also cleanup will be done when we stop RM. So this shouldn't 
be an issue.

What is happening is as under :

Let us assume there is RM1 and RM2.
Basically, when exception occurs,  RM1 waits for RM2 to become active and joins 
leader election again. As both RMs' have wrong configuration, RM1 will try to 
become active again(and not switch to standby) after RM2 has tried the same.
Now, as the problem is in call to {{refreshAll}}, both RMs' would be marked as 
ACTIVE in their respective RM Contexts. Because we set it to ACTIVE before 
calling refreshAll.

*The problem reported here is that RM is shown as Active when it is not 
actually ACTIVE i.e. UI is accessible and getServiceState returns both RM as 
Active. And when we access UI or get service state we check what's the state in 
RM Context. And that is ACTIVE.*
So for anyone who is accessing RM from command line or via UI, RM is 
active(*because RM context says so*), when it is not really active. Both RMs' 
are just trying incessantly to become active and failing.

That is why I suggested that we can update the RM Context. Infact changing RM 
context is necessary. We can decide when to stop active services, if at all.

So there are 2 options :
# We can set RM context to standby when exception occurs and stop active 
services. But if we do it, this would mean we will have to redo the work of 
starting active services again if this RM were to become ACTIVE.
# Introduce a new state (say WAITING_FOR_ACTIVE) and set this state when 
exception is thrown and check this state to stop active services when switching 
to standby. And not starting the services again in case of switching to ACTIVE.

Thoughts, [~sunilg], [~xgong] ?

> Both RM in active state when Admin#transitionToActive failure from refeshAll()
> ------------------------------------------------------------------------------
>                 Key: YARN-3893
>                 URL: https://issues.apache.org/jira/browse/YARN-3893
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: yarn-site.xml
> Cases that can cause this.
> # Capacity scheduler xml is wrongly configured during switch
> # Refresh ACL failure due to configuration
> # Refresh User group failure due to configuration
> Continuously both RM will try to be active
> {code}
> dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm1
> 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm2
> 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> {code}
> # Both Web UI active
> # Status shown as active for both RM

This message was sent by Atlassian JIRA

Reply via email to