[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627788#comment-14627788 ] Naganarasimha G R commented on YARN-3152: - Thanks [~neillfontes] for responding, Was waiting for some feedback on this from you and committers/PMC so that i can go ahead. Services will be not available thats for sure but another issue is from the logs its visible that even though RM start up has failed, another RM tries to come up and if it fails for the same reason it just keeps oscillating between the two... As you mentioned i feel WARN should be sufficent for this issue or we can adopt the approach specified by [~kasha] in YARN-3607 and change the behavior to log WARN by default and if {{yarn.fail-fast}} is specified to true then have the current behavior. thoughts? Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626448#comment-14626448 ] Neill Lima commented on YARN-3152: -- Hello [~Naganarasimha], I am visiting this topic since it is been a while, sorry about the delay. | When you mention RMs didn't start you mean you were not able to access web ui or the process down ? The RM didn't bootstrap so I couldn't even see the web ui or connect to the server. If the excluded file is not that relevant the absence of it should not block the RM to go up (that is a way higher priority). A [WARN] could be added to the logs though, just like the NNs do. Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316294#comment-14316294 ] Neill Lima commented on YARN-3152: -- [~vinodkv]] -- It fails in both RMs indeed. What was 'unexpected' it didn't fail in single RM because of the missing exclude file. Is the need of the exclude file so relevant to the RMs but not so much for the NNs? Because the behavior (NNs vs RMs) is very different. I lean towards the NNs behavior. Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316887#comment-14316887 ] Naganarasimha G R commented on YARN-3152: - [~vinodkv], [~xgong], [~neillfontes] [~rohithsharma] From the discussions till now what i could conclude is : As per the design if the required files are not there we need to fail fast, i.e. in case of Non HA cluster we should throw exception RM should fail to start . And in case of HA, transition to active should fail and none of the services should be active on failure. And as part of this jira we need to achieve this. Please inform if this approach is fine or needs more discussion on this. [~neillfontes], Hope you got to test with the steps which i mentioned in my earlier [comment|https://issues.apache.org/jira/browse/YARN-3152?focusedCommentId=14313875page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14313875]. Seems like you were able to see the same behavior what i mentioned in step 2 but wanted to know more about step3 where in i see the actives services getting oscillated between the 2 RM servers. Is it the same behavior as i mentioned or i am missing something. Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312711#comment-14312711 ] Vinod Kumar Vavilapalli commented on YARN-3152: --- Per [~xgong]'s comment [above|https://issues.apache.org/jira/browse/YARN-3152?focusedCommentId=14310921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14310921], it seems that the RM fails if the configured exclude file is absent in both HA and non HA cases, at startup as well as doing failover. If that is correct, please fix the title of the JIRA. bq. Assume there is 2 RM HA cluster and active RM is made down (manually or for any other reason) and standby RM is not having the configured exclude file [..] This is a general problem with two HA nodes being in sync w.r.t configuration. YARN-1666 and friends address this issue via a different solution that may not have been well documented. With that, you'd not even put the exclude files on local host of each of the RMs. Will that work for you? Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313309#comment-14313309 ] Rohith commented on YARN-3152: -- And not related to this Jira but similar, I see there is potential issue in {{AdminService.transitionToActive()}}. If there is any exception at {{rm.transitionToActive()}} then RM will be in standby, but if any exception at refreshAll() of any configuration like scheduler configuration has some problem, this case RM will be in active but elector is always trying to make it bring active in single node cluster with HA. I think RM should move explicitly to transitionToStandby just to notify admin or user that look into logs to identify configuration has problem. {code} try { rm.transitionToActive(); // call all refresh*s for active RM to get the updated configurations. refreshAll(); RMAuditLogger.logSuccess(user.getShortUserName(), transitionToActive, RMHAProtocolService); } catch (Exception e) { {code} Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312911#comment-14312911 ] Xuan Gong commented on YARN-3152: - bq. Per Xuan Gong's comment above, it seems that the RM fails if the configured exclude file is absent in both HA and non HA cases, at startup as well as doing failover. If that is correct, please fix the title of the JIRA. When startup in both HA and non-HA, if the configured exclude file is absent, the RM does not fail. Because we explicitly catch the exception. {code} try { this.includesFile = conf.get(YarnConfiguration.RM_NODES_INCLUDE_FILE_PATH, YarnConfiguration.DEFAULT_RM_NODES_INCLUDE_FILE_PATH); this.excludesFile = conf.get(YarnConfiguration.RM_NODES_EXCLUDE_FILE_PATH, YarnConfiguration.DEFAULT_RM_NODES_EXCLUDE_FILE_PATH); this.hostsReader = createHostsFileReader(this.includesFile, this.excludesFile); setDecomissionedNMsMetrics(); printConfiguredHosts(); } catch (YarnException ex) { disableHostsFileReader(ex); } catch (IOException ioe) { disableHostsFileReader(ioe); } {code} It only fails in fail-over case. After we start all the related service, we automatically call refresh* (refresh queue, nodelist, userToGroupMapping, etc). At this time, if the configured exclude file is absent, the refreshNodeList call will throw out the exception which cause the RM fails. Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312932#comment-14312932 ] Neill Lima commented on YARN-3152: -- [~xgong] got it correctly. Although in my case it was slightly different: I had a single RM and I wanted to make dual HA RM. I did not had the exclude file. It haven't failed so far. When I transitioned to HA mode, the RMs didn't start because of the missing file. That's what I said it was an inconsistent behavior. The NNs handled it more gracefully and they don't fail, they just logged a [WARN] in the logs. Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313081#comment-14313081 ] Vinod Kumar Vavilapalli commented on YARN-3152: --- bq. It only fails in fail-over case. Okay, we should try to be consistent then. Fail it in both places assuming that YARN-1666 and friends address the general problem with two HA nodes being in sync w.r.t configuration? Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311817#comment-14311817 ] Rohith commented on YARN-3152: -- I have some doubts on the behaviour , please help to understand it. bq. When we call any refresh, if there is any problem, I think that we need to throw out the exception instead of silence ignore it. Make sense via Admin operations i.e admin explicitely invoking refresh commands. But why do we ignore during RM start up? And consider only during transitionToActive? Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311771#comment-14311771 ] Xuan Gong commented on YARN-3152: - Thanks for the comment, [~Naganarasimha]. This is a very good suggestion. bq. currently there is different behavior in HA mode(starts of properly) and non HA mode So, in non-HA mode, when RM starts, it just starts all the services. But in HA mode, when RM transits to Active, it has two step process: * start all the services which is the same as in non-HA mode and it works fine. * call refreshAll() to refresh all the configuration, including refreshQueues, refreshUserToGroupInformation, refreshNodes, etc. At this point, if we configure the file path but the file does not exist, it will throw out the exception. When we call any refresh, if there is any problem, I think that we need to throw out the exception instead of silence ignore it. Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310921#comment-14310921 ] Xuan Gong commented on YARN-3152: - bq. NodesListManager.refreshNodes should be same as NodesListManager.serviceInit where in they handle gracefully if the configured paths for exclude/include doesn't exist. NodesListManager.refreshNodes is using the same way as NodesListManager.serviceInit. {code} NodeListManager.serviceInit: HostsFileReader hostsReader = new HostsFileReader(includesFile, (includesFile == null || includesFile.isEmpty()) ? null : this.rmContext.getConfigurationProvider() .getConfigurationInputStream(this.conf, includesFile), excludesFile, (excludesFile == null || excludesFile.isEmpty()) ? null : this.rmContext.getConfigurationProvider() .getConfigurationInputStream(this.conf, excludesFile)); {code} If the file does not exist, both of them will throw out the exception. No ? I understand what you consider. But I think that the earlier we found the issue (In our case, maybe hard to debug why the exclude nodes are not considered even we provides the exclude-node-list ), the better. So, we throw out such exception when active RM starts Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311058#comment-14311058 ] Naganarasimha G R commented on YARN-3152: - Hi [~xgong], bq. If the file does not exist, both of them will throw out the exception. No ? Yes you are right, intention is to not throw the exception, but may be can log WARN message saying Configured file doesn't exist (with the path info). bq. So, we. throw out such exception when active RM start I see currently there is different behavior in HA mode(starts of properly) and non HA mode(starts of but continuous logs saying file not found) which i feel should not be the behavior Also in other places where we configure the file path we do check whether file exists (ex. nodeHealthScripts), so i was of the opinion that its better to add a file exists check here too or atleast the behavior for HA and Non HA mode should be same. Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA
[ https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310487#comment-14310487 ] Xuan Gong commented on YARN-3152: - In the yarn-site.xml, we do set some value for the yarn.resourcemanager.nodes.exclude-path. Since the file does not exist, we should throw out the exception. When RM starts to transit to active, it automatically calls all the refresh*s. It is by design if any of them fails, we should let RM fail. Missing hadoop exclude file fails RMs in HA --- Key: YARN-3152 URL: https://issues.apache.org/jira/browse/YARN-3152 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Environment: Debian 7 Reporter: Neill Lima Assignee: Naganarasimha G R NI have two NNs in HA, they do not fail when the exclude file is not present (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in HA. I didn't create the exclude file at this point as well. I applied the HA RM settings properly and when I started both RMs I started getting this exception: 2015-02-06 12:25:25,326 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS=All users are allowed 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file or directory) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297) ... 5 more 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 0x44af32566180094 closed 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 sessionTimeout=1 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate using SASL (unknown error) 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to x.x.x.x/x.x.x.x:2181, initiating session The issue is descriptive enough to resolve the problem - and it has been fixed by creating the exclude file. I just think as of a improvement: - Should RMs ignore the missing file as the NNs did? - Should single RM fail even when the file is not present? Just suggesting this improvement to keep the behavior consistent when working with in HA (both NNs and RMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332)