Varun Saxena commented on YARN-3893:

Thanks for the patch [~bibinchundatt]. Few comments.

# Nit : Should be "Exception in state transition"
          throw new ServiceFailedException(
              "Exception in state transistion", re);
# IMO, no need to throw ServiceFailedException when catching exception while 
calling reinitialize. The throw below should suffice. Just set the flag. 
According to me, we should retain the original exception.
# Add a comment indicating what the flag does.
# Maybe rename the flag to reinitActiveServices instead of reinitialize.
# The flag according to me, semantically speaking, doesn't quite belong to 
AdminService. Can be in ResourceManager or RMContext. Thoughts ?
# Can you add a test to verify the fix ? 
# I think instead of relying on transitionToStandby to change state to standby, 
we can explicitly change the state in AdminService. Thats because even 
stopActiveServices can throw an Exception and if it does, state won't change to 
STANDBY. This call to stop should not throw an exception, but as services keep 
on getting added you never know how a particular service may behave. We should 
be immune to it. Try something like below.
# Just a suggestion. If we do above, maybe call stopActiveServices and 
reinitialize directly instead of calling transitonToStandby. This is because as 
I said in a comment above, transitionToStandby would print an audit log saying 
transition is successful. But reinitialize subsequently may fail. And not 
printing this audit log will be consistent with  transitionToActive failing 
during starting active services. Thoughts ?

> Both RM in active state when Admin#transitionToActive failure from refeshAll()
> ------------------------------------------------------------------------------
>                 Key: YARN-3893
>                 URL: https://issues.apache.org/jira/browse/YARN-3893
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
> 0003-YARN-3893.patch, yarn-site.xml
> Cases that can cause this.
> # Capacity scheduler xml is wrongly configured during switch
> # Refresh ACL failure due to configuration
> # Refresh User group failure due to configuration
> Continuously both RM will try to be active
> {code}
> dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm1
> 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm2
> 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> {code}
> # Both Web UI active
> # Status shown as active for both RM

This message was sent by Atlassian JIRA

Reply via email to