Hello.
I have setup'ed hadoop HA-cluster with autofailoer on namenodes and
resource manager by this manuals
http://www.oracle.com/technetwork/articles/servers-storage-admin/hadoop-cluster-solaris-2203962.html#16
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html
So, when i halt only hadoop daemon, zookeeper swithes to active NameNode
and ResMan. But when i halt a whole server (with zookeeper member of
quorum) switches only ResMan.
I have tried many configurations.
here zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/var/zookeeper/data
clientPort=2181
cnxTimeout=3
server.1=name-node1:2888:3888
server.2=name-node2:2888:3888
server.3=resource-manager:2888:3888
server.4=resource-manager2:2888:3888
server.5=data-node1:2888:3888
server.6=data-node2:2888:3888
group.1=1:2:5
group.2=3:4:6
core-site.xml
<property>
<name>ha.zookeeper.quorum</name>
<value>name-node1:2181,name-node2:2181,data-node1:2181</value>
</property>
yarn-site.xml
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>resource-manager:2181,resource-manager2:2181,data-node2:2181</value>
</property>
When i halted whole host name-node1 at zookeeper's log i see next:
2015-05-21 13:24:22,177 [myid:5] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@780] - Connection broken for
id 3, my id = 5, error =
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
2015-05-21 13:24:22,178 [myid:5] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
2015-05-21 13:24:22,179 [myid:5] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@697] - Interrupted while
waiting for message on queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:849)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:64)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:685)
2015-05-21 13:24:22,179 [myid:5] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@706] - Send worker leaving thread
When i halted whole host resource-manager at zookeeper's log i see next:
2015-05-21 13:24:22,990 [myid:4] - INFO [ProcessThread(sid:4
cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
when processing sessionid:0x34d767b51ef0000 type:create cxid:0x9
zxid:0x1c0000004e txntype:-1 reqpath:n/a Error
Path:/yarn-leader-election/dph-rm/ActiveStandbyElectorLock
Error:KeeperErrorCode = NodeExists for
/yarn-leader-election/dph-rm/ActiveStandbyElectorLock
After this ResMan2 became an active.
What i am doing wrong?
Thanks.