[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14311771#comment-14311771
 ] 

Xuan Gong commented on YARN-3152:
---------------------------------

Thanks for the comment, [~Naganarasimha]. This is a very good suggestion.

bq. currently there is different behavior in HA mode(starts of properly) and 
non HA mode

So, in non-HA mode, when RM starts, it just starts all the services.
But in HA mode, when RM transits to Active, it has two step process:
* start all the services which is the same as in non-HA mode and it works fine.
* call refreshAll() to refresh all the configuration, including refreshQueues, 
refreshUserToGroupInformation, refreshNodes, etc. At this point, if we 
configure the file path but the file does not exist, it will throw out the 
exception.

When we call any refresh, if there is any problem, I think that we need to 
throw out the exception instead of silence ignore it. 

> Missing hadoop exclude file fails RMs in HA
> -------------------------------------------
>
>                 Key: YARN-3152
>                 URL: https://issues.apache.org/jira/browse/YARN-3152
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>         Environment: Debian 7
>            Reporter: Neill Lima
>            Assignee: Naganarasimha G R
>
> NI have two NNs in HA, they do not fail when the exclude file is not present 
> (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
> HA. I didn't create the exclude file at this point as well. I applied the HA 
> RM settings properly and when I started both RMs I started getting this 
> exception:
> 2015-02-06 12:25:25,326 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
> OPERATION=transitionToActive    TARGET=RMHAProtocolService      
> RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
> PERMISSIONS=All users are allowed
> 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
>       at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
>       at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
>       at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
>       ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: 
> java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
> or directory)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
>       ... 5 more
> 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
> 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x44af32566180094 closed
> 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
> client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
> sessionTimeout=10000 
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
> 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
> using SASL (unknown error)
> 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to x.x.x.x/x.x.x.x:2181, initiating session
> The issue is descriptive enough to resolve the problem - and it has been 
> fixed by creating the exclude file. 
> I just think as of a improvement: 
> - Should RMs ignore the missing file as the NNs did?
> - Should single RM fail even when the file is not present?
> Just suggesting this improvement to keep the behavior consistent when working 
> with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to