[ https://issues.apache.org/jira/browse/YARN-3871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zack Marsh resolved YARN-3871. ------------------------------ Resolution: Invalid > ResourceManager down after Blueprint install > --------------------------------------------- > > Key: YARN-3871 > URL: https://issues.apache.org/jira/browse/YARN-3871 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.1 > Environment: ambari-2.1.0-1295, hdp-2.3.0.0-2497, sles11sp3 > Reporter: Zack Marsh > Attachments: yarn-yarn-resourcemanager-piripiri3.log, > yarn-yarn-resourcemanager-piripiri3.out > > > On a 3-Master HDP 2.3 cluster installed with HDP-2.3.0.0-2482 and > Ambari-2.1.0-1266, the YARN ResourceManager was down following the Blueprint > install. > It's important to note that nothing failed during the Blueprint install. The > ResourceManager shutdown because of an inability to connect to Zookeeper. > Excerpt from the ResourceManager log: > {code} > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client > environment:java.library.path=:/usr/hdp/2.3.0.0-2482/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2482/hadoop/lib/native:/usr/hdp/2.3.0.0-2482/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2482/hadoop/lib/native > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client environment:java.io.tmpdir=/tmp > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client environment:java.compiler=<NA> > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client environment:os.name=Linux > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client environment:os.arch=amd64 > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client > environment:os.version=3.0.101-0.50.TDC.1.R.0-default > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client environment:user.name=yarn > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client environment:user.home=/home/yarn > 2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper > (Environment.java:logEnv(100)) - Client > environment:user.dir=/usr/hdp/2.3.0.0-2482/hadoop-yarn > 2015-06-26 03:35:47,190 INFO zookeeper.ZooKeeper > (ZooKeeper.java:<init>(438)) - Initiating client connection, > connectString=piripiri2.labs.teradata.com:2181,piripiri1.labs.teradata.com:2181,piripiri3.labs.teradata.com:2181 > sessionTimeout=10000 > watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@59d2103b > 2015-06-26 03:35:47,209 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:47,276 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2015-06-26 03:35:47,380 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:47,381 INFO zookeeper.ClientCnxn > (ClientCnxn.java:primeConnection(852)) - Socket connection established to > piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session > 2015-06-26 03:35:47,452 INFO zookeeper.ClientCnxn > (ClientCnxn.java:run(1098)) - Unable to read additional data from server > sessionid 0x0, likely server has closed socket, closing socket connection and > attempting reconnect > 2015-06-26 03:35:48,067 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:48,378 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2015-06-26 03:35:49,914 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:49,915 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2015-06-26 03:35:50,028 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:50,028 INFO zookeeper.ClientCnxn > (ClientCnxn.java:primeConnection(852)) - Socket connection established to > piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session > 2015-06-26 03:35:50,030 INFO zookeeper.ClientCnxn > (ClientCnxn.java:run(1098)) - Unable to read additional data from server > sessionid 0x0, likely server has closed socket, closing socket connection and > attempting reconnect > 2015-06-26 03:35:50,133 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:50,134 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2015-06-26 03:35:52,064 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:52,065 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2015-06-26 03:35:52,901 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:52,901 INFO zookeeper.ClientCnxn > (ClientCnxn.java:primeConnection(852)) - Socket connection established to > piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session > 2015-06-26 03:35:52,902 INFO zookeeper.ClientCnxn > (ClientCnxn.java:run(1098)) - Unable to read additional data from server > sessionid 0x0, likely server has closed socket, closing socket connection and > attempting reconnect > 2015-06-26 03:35:53,570 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:53,571 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2015-06-26 03:35:55,541 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:55,542 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2015-06-26 03:35:56,513 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:56,514 INFO zookeeper.ClientCnxn > (ClientCnxn.java:primeConnection(852)) - Socket connection established to > piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session > 2015-06-26 03:35:56,515 INFO zookeeper.ClientCnxn > (ClientCnxn.java:run(1098)) - Unable to read additional data from server > sessionid 0x0, likely server has closed socket, closing socket connection and > attempting reconnect > 2015-06-26 03:35:56,821 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server > piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate > using SASL (unknown error) > 2015-06-26 03:35:56,822 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > 2015-06-26 03:35:57,205 ERROR ha.ActiveStandbyElector > (ActiveStandbyElector.java:waitForZKConnectionEvent(1044)) - Connection timed > out: couldn't connect to ZooKeeper in 10000 milliseconds > 2015-06-26 03:35:57,396 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) > - Session: 0x0 closed > 2015-06-26 03:35:57,397 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) > - EventThread shut down > 2015-06-26 03:35:57,403 INFO service.AbstractService > (AbstractService.java:noteFailure(272)) - Service > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService failed > in state INITED; cause: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047) > at > org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018) > at > org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633) > at > org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767) > at > org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226) > 2015-06-26 03:35:57,404 INFO service.AbstractService > (AbstractService.java:noteFailure(272)) - Service > org.apache.hadoop.yarn.server.resourcemanager.AdminService failed in state > INITED; cause: org.apache.hadoop.service.ServiceStateException: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > org.apache.hadoop.service.ServiceStateException: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047) > at > org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018) > at > org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633) > at > org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767) > at > org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > ... 7 more > 2015-06-26 03:35:57,404 INFO service.AbstractService > (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in > state INITED; cause: org.apache.hadoop.service.ServiceStateException: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > org.apache.hadoop.service.ServiceStateException: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047) > at > org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018) > at > org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633) > at > org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767) > at > org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > ... 7 more > 2015-06-26 03:35:57,405 INFO resourcemanager.ResourceManager > (ResourceManager.java:transitionToStandby(1068)) - Transitioning to standby > state > 2015-06-26 03:35:57,405 INFO resourcemanager.ResourceManager > (ResourceManager.java:transitionToStandby(1075)) - Transitioned to standby > state > 2015-06-26 03:35:57,405 FATAL resourcemanager.ResourceManager > (ResourceManager.java:main(1230)) - Error starting ResourceManager > org.apache.hadoop.service.ServiceStateException: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226) > Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047) > at > org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018) > at > org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633) > at > org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767) > at > org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > ... 7 more > 2015-06-26 03:35:57,407 INFO resourcemanager.ResourceManager > (LogAdapter.java:info(45)) - SHUTDOWN_MSG: > /************************************************************ > SHUTDOWN_MSG: Shutting down ResourceManager at piripiri3/39.0.40.3 > ************************************************************/ > {code} > This issue was observed again on a 3-Master cluster installed with > HDP-2.3.0.0-2497 and Ambari-2.1.0-1295. > YARN logs attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)