[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

Varun Saxena (JIRA) Tue, 25 Aug 2015 03:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711036#comment-14711036
 ]


Varun Saxena commented on YARN-3893:
------------------------------------

Few additional comments :

* Below exception block i.e. exception block after call to refreshAll, if 
{{YarnConfiguration.shouldRMFailFast(getConfig())}} is true, we merely post 
fatal event and do not return or throw an exception. This would lead to success 
audit log for transition to active being printed, which doesn't quite look 
correct. Because we are encountering some problem during call to transition. We 
should either return or throw a ServiceFailedException here as well. Although 
both are OK because RM would anyways be down later but I would prefer 
exception. 
{code} 
324         } catch (Exception e) {
325           if (isRMActive() && 
YarnConfiguration.shouldRMFailFast(getConfig())) {
326             rmContext.getDispatcher().getEventHandler()
327                 .handle(new 
RMFatalEvent(RMFatalEventType.ACTIVE_REFRESH_FAIL, e));
328           }else{
329             rm.handleTransitionToStandBy();
330             throw new ServiceFailedException(
331                 "Error on refreshAll during transistion to Active", e);
332           }
333         }
334         RMAuditLogger.logSuccess(user.getShortUserName(), 
"transitionToActive",
335             "RMHAProtocolService");
336       }
{code}

* In TestRMHA, below import is unused. 
{code}
        import io.netty.channel.MessageSizeEstimator.Handle;
{code}

* A nit : There should be a space before else.
{code}
328           }else{
329             rm.handleTransitionToStandBy();
{code}

* In the test added, assert is not required in the exception block after first 
call to transitionToActive

* Maybe we can add an assert in test for service state being STANDBY after call 
to transitionToActive with incorrect capacity scheduler config and fail-fast 
being false.

> Both RM in active state when Admin#transitionToActive failure from refeshAll()
> ------------------------------------------------------------------------------
>
>                 Key: YARN-3893
>                 URL: https://issues.apache.org/jira/browse/YARN-3893
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
> 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 
> yarn-site.xml
>
>
> Cases that can cause this.
> # Capacity scheduler xml is wrongly configured during switch
> # Refresh ACL failure due to configuration
> # Refresh User group failure due to configuration
> Continuously both RM will try to be active
> {code}
> dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm1
> 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin>
>  ./yarn rmadmin  -getServiceState rm2
> 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> active
> {code}
> # Both Web UI active
> # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

Reply via email to