Zack Marsh created YARN-3871:
--------------------------------
Summary: ResourceManager down after Blueprint install
Key: YARN-3871
URL: https://issues.apache.org/jira/browse/YARN-3871
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 2.7.1
Environment: ambari-2.1.0-1295, hdp-2.3.0.0-2497, sles11sp3
Reporter: Zack Marsh
Priority: Critical
Attachments: yarn-yarn-resourcemanager-piripiri3.log,
yarn-yarn-resourcemanager-piripiri3.out
On a 3-Master HDP 2.3 cluster installed with HDP-2.3.0.0-2482 and
Ambari-2.1.0-1266, the YARN ResourceManager was down following the Blueprint
install.
It's important to note that nothing failed during the Blueprint install. The
ResourceManager shutdown because of an inability to connect to Zookeeper.
Excerpt from the ResourceManager log:
{code}
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client
environment:java.library.path=:/usr/hdp/2.3.0.0-2482/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2482/hadoop/lib/native:/usr/hdp/2.3.0.0-2482/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2482/hadoop/lib/native
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client environment:java.io.tmpdir=/tmp
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client environment:java.compiler=<NA>
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client environment:os.name=Linux
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client environment:os.arch=amd64
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client
environment:os.version=3.0.101-0.50.TDC.1.R.0-default
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client environment:user.name=yarn
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client environment:user.home=/home/yarn
2015-06-26 03:35:47,188 INFO zookeeper.ZooKeeper
(Environment.java:logEnv(100)) - Client
environment:user.dir=/usr/hdp/2.3.0.0-2482/hadoop-yarn
2015-06-26 03:35:47,190 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438))
- Initiating client connection,
connectString=piripiri2.labs.teradata.com:2181,piripiri1.labs.teradata.com:2181,piripiri3.labs.teradata.com:2181
sessionTimeout=10000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@59d2103b
2015-06-26 03:35:47,209 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:47,276 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1102))
- Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:47,380 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:47,381 INFO zookeeper.ClientCnxn
(ClientCnxn.java:primeConnection(852)) - Socket connection established to
piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session
2015-06-26 03:35:47,452 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1098))
- Unable to read additional data from server sessionid 0x0, likely server has
closed socket, closing socket connection and attempting reconnect
2015-06-26 03:35:48,067 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:48,378 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1102))
- Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:49,914 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:49,915 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1102))
- Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:50,028 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:50,028 INFO zookeeper.ClientCnxn
(ClientCnxn.java:primeConnection(852)) - Socket connection established to
piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session
2015-06-26 03:35:50,030 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1098))
- Unable to read additional data from server sessionid 0x0, likely server has
closed socket, closing socket connection and attempting reconnect
2015-06-26 03:35:50,133 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:50,134 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1102))
- Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:52,064 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:52,065 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1102))
- Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:52,901 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:52,901 INFO zookeeper.ClientCnxn
(ClientCnxn.java:primeConnection(852)) - Socket connection established to
piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session
2015-06-26 03:35:52,902 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1098))
- Unable to read additional data from server sessionid 0x0, likely server has
closed socket, closing socket connection and attempting reconnect
2015-06-26 03:35:53,570 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:53,571 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1102))
- Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:55,541 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:55,542 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1102))
- Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:56,513 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:56,514 INFO zookeeper.ClientCnxn
(ClientCnxn.java:primeConnection(852)) - Socket connection established to
piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session
2015-06-26 03:35:56,515 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1098))
- Unable to read additional data from server sessionid 0x0, likely server has
closed socket, closing socket connection and attempting reconnect
2015-06-26 03:35:56,821 INFO zookeeper.ClientCnxn
(ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server
piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate
using SASL (unknown error)
2015-06-26 03:35:56,822 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1102))
- Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:57,205 ERROR ha.ActiveStandbyElector
(ActiveStandbyElector.java:waitForZKConnectionEvent(1044)) - Connection timed
out: couldn't connect to ZooKeeper in 10000 milliseconds
2015-06-26 03:35:57,396 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) -
Session: 0x0 closed
2015-06-26 03:35:57,397 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) -
EventThread shut down
2015-06-26 03:35:57,403 INFO service.AbstractService
(AbstractService.java:noteFailure(272)) - Service
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService failed in
state INITED; cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at
org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047)
at
org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018)
at
org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633)
at
org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
at
org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
at
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
2015-06-26 03:35:57,404 INFO service.AbstractService
(AbstractService.java:noteFailure(272)) - Service
org.apache.hadoop.yarn.server.resourcemanager.AdminService failed in state
INITED; cause: org.apache.hadoop.service.ServiceStateException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
org.apache.hadoop.service.ServiceStateException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
at
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at
org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047)
at
org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018)
at
org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633)
at
org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
at
org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
at
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 7 more
2015-06-26 03:35:57,404 INFO service.AbstractService
(AbstractService.java:noteFailure(272)) - Service ResourceManager failed in
state INITED; cause: org.apache.hadoop.service.ServiceStateException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
org.apache.hadoop.service.ServiceStateException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
at
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at
org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047)
at
org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018)
at
org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633)
at
org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
at
org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
at
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 7 more
2015-06-26 03:35:57,405 INFO resourcemanager.ResourceManager
(ResourceManager.java:transitionToStandby(1068)) - Transitioning to standby
state
2015-06-26 03:35:57,405 INFO resourcemanager.ResourceManager
(ResourceManager.java:transitionToStandby(1075)) - Transitioned to standby state
2015-06-26 03:35:57,405 FATAL resourcemanager.ResourceManager
(ResourceManager.java:main(1230)) - Error starting ResourceManager
org.apache.hadoop.service.ServiceStateException:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss
at
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at
org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047)
at
org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018)
at
org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633)
at
org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
at
org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
at
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 7 more
2015-06-26 03:35:57,407 INFO resourcemanager.ResourceManager
(LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down ResourceManager at piripiri3/39.0.40.3
************************************************************/
{code}
This issue was observed again on a 3-Master cluster installed with
HDP-2.3.0.0-2497 and Ambari-2.1.0-1295.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)