[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-07-15 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627788#comment-14627788
 ] 

Naganarasimha G R commented on YARN-3152:
-

Thanks [~neillfontes] for responding, Was waiting for some feedback on this 
from you and committers/PMC so that i can go ahead.  Services will be not 
available thats for sure but another issue is from the logs its visible that 
even though RM start up has failed, another RM tries to come up and if it fails 
for the same reason it just keeps oscillating between the two...  As you 
mentioned i feel WARN should be sufficent for this issue or we can adopt the 
approach specified by [~kasha] in YARN-3607 and change the behavior to log WARN 
by default and if {{yarn.fail-fast}} is specified to true then have the current 
behavior. thoughts?

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-07-14 Thread Neill Lima (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626448#comment-14626448
 ] 

Neill Lima commented on YARN-3152:
--

Hello [~Naganarasimha], I am visiting this topic since it is been a while, 
sorry about the delay.

| When you mention RMs didn't start you mean you were not able to access web 
ui or the process down ?

The RM didn't bootstrap so I couldn't even see the web ui or connect to the 
server. If the excluded file is not that relevant the absence of it should not 
block the RM to go up (that is a way higher priority). A [WARN] could be added 
to the logs though, just like the NNs do.

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-11 Thread Neill Lima (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316294#comment-14316294
 ] 

Neill Lima commented on YARN-3152:
--

[~vinodkv]] -- It fails in both RMs indeed. What was 'unexpected' it didn't 
fail in single RM because of the missing exclude file. 

Is the need of the exclude file so relevant to the RMs but not so much for the 
NNs? Because the behavior (NNs vs RMs) is very different. I lean towards the 
NNs behavior. 

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-11 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316887#comment-14316887
 ] 

Naganarasimha G R commented on YARN-3152:
-

[~vinodkv], [~xgong], [~neillfontes]  [~rohithsharma]
 From the discussions till now what i could conclude is :  As per the 
design if the required files are not there we need to fail fast, i.e. in case 
of Non HA cluster we should throw exception  RM should fail to start . And in 
case of HA, transition to active should fail and none of the services should be 
active on failure. And as part of this jira we need to achieve this. Please 
inform if this approach is fine or needs more discussion on this.

[~neillfontes],
Hope you got to test with the steps which i mentioned in my earlier 
[comment|https://issues.apache.org/jira/browse/YARN-3152?focusedCommentId=14313875page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14313875].
 Seems like you were able to see the same behavior what i mentioned in step 2 
but wanted to know more about step3 where in i see the actives services getting 
oscillated between the 2 RM servers. Is it the same behavior as i mentioned or 
i am missing something.  

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-09 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312711#comment-14312711
 ] 

Vinod Kumar Vavilapalli commented on YARN-3152:
---

Per [~xgong]'s comment 
[above|https://issues.apache.org/jira/browse/YARN-3152?focusedCommentId=14310921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14310921],
 it seems that the RM fails if the configured exclude file is absent in both HA 
and non HA cases, at startup as well as doing failover. If that is correct, 
please fix the title of the JIRA.

bq.  Assume there is 2 RM HA cluster and active RM is made down (manually or 
for any other reason) and standby RM is not having the configured exclude file 
[..]
This is a general problem with two HA nodes being in sync w.r.t configuration. 
YARN-1666 and friends address this issue via a different solution that may not 
have been well documented. With that, you'd not even put the exclude files on 
local host of each of the RMs. Will that work for you?

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313309#comment-14313309
 ] 

Rohith commented on YARN-3152:
--

And not related to this Jira but similar, I see there is potential issue in 
{{AdminService.transitionToActive()}}. If there is any exception at 
{{rm.transitionToActive()}} then RM will be in standby, but if any exception at 
refreshAll() of any configuration like scheduler configuration has some 
problem, this case RM will be in active but elector is always trying to make it 
bring active in single node cluster with HA. I think RM should move explicitly 
to transitionToStandby just to notify admin or user that look into logs to 
identify configuration has problem.
{code}
try {
  rm.transitionToActive();
  // call all refresh*s for active RM to get the updated configurations.
  refreshAll();
  RMAuditLogger.logSuccess(user.getShortUserName(),
  transitionToActive, RMHAProtocolService);
} catch (Exception e) {
{code}

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-09 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312911#comment-14312911
 ] 

Xuan Gong commented on YARN-3152:
-

bq. Per Xuan Gong's comment above, it seems that the RM fails if the configured 
exclude file is absent in both HA and non HA cases, at startup as well as doing 
failover. If that is correct, please fix the title of the JIRA.

When startup in both HA and non-HA, if the configured exclude file is absent, 
the RM does not fail. Because we explicitly catch the exception. 
{code}
try {
  this.includesFile = conf.get(YarnConfiguration.RM_NODES_INCLUDE_FILE_PATH,
  YarnConfiguration.DEFAULT_RM_NODES_INCLUDE_FILE_PATH);
  this.excludesFile = conf.get(YarnConfiguration.RM_NODES_EXCLUDE_FILE_PATH,
  YarnConfiguration.DEFAULT_RM_NODES_EXCLUDE_FILE_PATH);
  this.hostsReader =
  createHostsFileReader(this.includesFile, this.excludesFile);
  setDecomissionedNMsMetrics();
  printConfiguredHosts();
} catch (YarnException ex) {
  disableHostsFileReader(ex);
} catch (IOException ioe) {
  disableHostsFileReader(ioe);
}
{code}

It only fails in fail-over case. After we start all the related service, we 
automatically call refresh* (refresh queue, nodelist, userToGroupMapping, etc). 
At this time, if the configured exclude file is absent, the refreshNodeList 
call will throw out the exception which cause the RM fails.

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the 

[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-09 Thread Neill Lima (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312932#comment-14312932
 ] 

Neill Lima commented on YARN-3152:
--

[~xgong] got it correctly. Although in my case it was slightly different:

I had a single RM and I wanted to make dual HA RM. I did not had the exclude 
file. It haven't failed so far. When I transitioned to HA mode, the RMs didn't 
start because of the missing file. That's what I said it was an inconsistent 
behavior. The NNs handled it more gracefully and they don't fail, they just 
logged a [WARN] in the logs. 

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-09 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313081#comment-14313081
 ] 

Vinod Kumar Vavilapalli commented on YARN-3152:
---

bq. It only fails in fail-over case.
Okay, we should try to be consistent then. Fail it in both places assuming that 
YARN-1666 and friends address the general problem with two HA nodes being in 
sync w.r.t configuration?

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311817#comment-14311817
 ] 

Rohith commented on YARN-3152:
--

I have some doubts on the behaviour , please help to understand it.
bq. When we call any refresh, if there is any problem, I think that we need to 
throw out the exception instead of silence ignore it. 
Make sense via Admin operations i.e admin explicitely invoking refresh commands.
But why do we ignore during RM start up? And consider only during 
transitionToActive?

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-08 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311771#comment-14311771
 ] 

Xuan Gong commented on YARN-3152:
-

Thanks for the comment, [~Naganarasimha]. This is a very good suggestion.

bq. currently there is different behavior in HA mode(starts of properly) and 
non HA mode

So, in non-HA mode, when RM starts, it just starts all the services.
But in HA mode, when RM transits to Active, it has two step process:
* start all the services which is the same as in non-HA mode and it works fine.
* call refreshAll() to refresh all the configuration, including refreshQueues, 
refreshUserToGroupInformation, refreshNodes, etc. At this point, if we 
configure the file path but the file does not exist, it will throw out the 
exception.

When we call any refresh, if there is any problem, I think that we need to 
throw out the exception instead of silence ignore it. 

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-07 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310921#comment-14310921
 ] 

Xuan Gong commented on YARN-3152:
-

bq. NodesListManager.refreshNodes should be same as 
NodesListManager.serviceInit where in they handle gracefully if the configured 
paths for exclude/include doesn't exist.

NodesListManager.refreshNodes is using the same way as 
NodesListManager.serviceInit.

{code}
NodeListManager.serviceInit:
HostsFileReader hostsReader =
new HostsFileReader(includesFile,
(includesFile == null || includesFile.isEmpty()) ? null
: this.rmContext.getConfigurationProvider()
.getConfigurationInputStream(this.conf, includesFile),
excludesFile,
(excludesFile == null || excludesFile.isEmpty()) ? null
: this.rmContext.getConfigurationProvider()
.getConfigurationInputStream(this.conf, excludesFile));
{code}

If the file does not exist, both of them will throw out the exception. No ?

I understand what you consider. But I think that the earlier we found the issue 
(In our case, maybe hard to debug why the exclude nodes are not considered even 
we provides the exclude-node-list ), the better. So, we throw out such 
exception when active RM starts 

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and 

[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-07 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311058#comment-14311058
 ] 

Naganarasimha G R commented on YARN-3152:
-

Hi [~xgong],
bq. If the file does not exist, both of them will throw out the exception. No ?
Yes you are right, intention is to not throw the exception, but may be can log 
WARN message saying Configured file doesn't exist (with the path info). 

bq. So, we. throw out such exception when active RM start
I see currently there is different behavior in HA mode(starts of properly) and 
non HA mode(starts of but continuous logs saying file not found) which i feel 
should not be the behavior
Also in other places where we configure the file path we do check whether file 
exists (ex. nodeHealthScripts), so i was of the opinion that its better to add 
a file exists check here too or atleast the behavior for HA and Non HA mode 
should be same.



 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3152) Missing hadoop exclude file fails RMs in HA

2015-02-06 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310487#comment-14310487
 ] 

Xuan Gong commented on YARN-3152:
-

In the yarn-site.xml, we do set some value for the 
yarn.resourcemanager.nodes.exclude-path. Since the file does not exist, we 
should throw out the exception. When RM starts to transit to active, it 
automatically calls all the refresh*s. It is by design if any of them fails, we 
should let RM fail.

 Missing hadoop exclude file fails RMs in HA
 ---

 Key: YARN-3152
 URL: https://issues.apache.org/jira/browse/YARN-3152
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
 Environment: Debian 7
Reporter: Neill Lima
Assignee: Naganarasimha G R

 NI have two NNs in HA, they do not fail when the exclude file is not present 
 (hadoop-2.6.0/etc/hadoop/exclude). I had one RM and I wanted to make two in 
 HA. I didn't create the exclude file at this point as well. I applied the HA 
 RM settings properly and when I started both RMs I started getting this 
 exception:
 2015-02-06 12:25:25,326 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
 PERMISSIONS=All users are allowed
 2015-02-06 12:25:25,326 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805)
   at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:304)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
   ... 4 more
 Caused by: org.apache.hadoop.ha.ServiceFailedException: 
 java.io.FileNotFoundException: /hadoop-2.6.0/etc/hadoop/exclude (No such file 
 or directory)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:626)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
   ... 5 more
 2015-02-06 12:25:25,327 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Trying to re-establish ZK session
 2015-02-06 12:25:25,339 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x44af32566180094 closed
 2015-02-06 12:25:26,340 INFO org.apache.zookeeper.ZooKeeper: Initiating 
 client connection, connectString=x.x.x.x:2181,x.x.x.x:2181 
 sessionTimeout=1 
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@307587c
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
 connection to server x.x.x.x/x.x.x.x:2181. Will not attempt to authenticate 
 using SASL (unknown error)
 2015-02-06 12:25:26,341 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established to x.x.x.x/x.x.x.x:2181, initiating session
 The issue is descriptive enough to resolve the problem - and it has been 
 fixed by creating the exclude file. 
 I just think as of a improvement: 
 - Should RMs ignore the missing file as the NNs did?
 - Should single RM fail even when the file is not present?
 Just suggesting this improvement to keep the behavior consistent when working 
 with in HA (both NNs and RMs). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)