[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310921#comment-14310921 ]
Xuan Gong commented on YARN-3152: --------------------------------- bq. NodesListManager.refreshNodes should be same as NodesListManager.serviceInit where in they handle gracefully if the configured paths for exclude/include doesn't exist. NodesListManager.refreshNodes is using the same way as NodesListManager.serviceInit. {code} NodeListManager.serviceInit: HostsFileReader hostsReader = new HostsFileReader(includesFile, (includesFile == null || includesFile.isEmpty()) ? null : this.rmContext.getConfigurationProvider() .getConfigurationInputStream(this.conf, includesFile), excludesFile, (excludesFile == null || excludesFile.isEmpty()) ? null : this.rmContext.getConfigurationProvider() .getConfigurationInputStream(this.conf, excludesFile)); {code} If the file does not exist, both of them will throw out the exception. No ? I understand what you consider. But I think that the earlier we found the issue (In our case, maybe hard to debug why the exclude nodes are not considered even we provides the exclude-node-list ), the better. So, we throw out such exception when active RM starts > Missing hadoop exclude file fails RMs in HA > ------------------------------------------- > > Key: YARN-3152 > URL: https://issues.apache.org/jira/browse/YARN-3152 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Environment: Debian 7 > Reporter: Neill Lima > Assignee: Naganarasimha G R > > NI have two NNs in HA, they do not fail when the exclude file is not present > (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in > HA. I didn't create the exclude file at this point as well. I applied the HA > RM settings properly and when I started both RMs I started getting this > exception: > 2015-02-06 12:25:25,326 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root > OPERATION=transitionToActive TARGET=RMHAProtocolService > RESULT=FAILURE DESCRIPTION=Exception transitioning to active > PERMISSIONS=All users are allowed > 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: > java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file > or directory) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) > ... 5 more > 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Trying to re-establish ZK session > 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x44af32566180094 closed > 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 > sessionTimeout=10000 > watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c > 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to x.x.x.x/x.x.x.x:2181, initiating session > The issue is descriptive enough to resolve the problem - and it has been > fixed by creating the exclude file. > I just think as of a improvement: > - Should RMs ignore the missing file as the NNs did? > - Should single RM fail even when the file is not present? > Just suggesting this improvement to keep the behavior consistent when working > with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)