Re: zookeeper for namenode doesn't elect active, when host is down

[email protected] Fri, 22 May 2015 04:17:57 -0700

Thanks for reply, Chris. Now, I understand. Lookm i have a 2 NameNodes(maximum at HA-cluster), when started ZKFS. So, when host with one nodehalt, there is ONLY one ZKFS is running. And it cannot elect a leader.When i try to run a ZKFC on datanodes or ResMan i get an error:

Exception in thread "main"org.apache.hadoop.HadoopIllegalArgumentException: Could not get thenamenode ID of this node. You may run zkfc on the node other than namenode.atorg.apache.hadoop.hdfs.tools.DFSZKFailoverController.create(DFSZKFailoverController.java:128)atorg.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:177)

 (-u) 999990
virtual memory          (kbytes, -v) unlimited



How can i start ZKFS on other node, other than namenode?

21.05.2015 20:28, Chris Nauroth пишет:

Hello,

The HA implementations for NameNode and ResourceManager are slightly
different.  For the NameNode, there is a separate process called the
ZKFailoverController that owns the ZooKeeper session.  When that process
sees that it has obtained a lock through ZooKeeper, then it sends a
command to the NameNode on the same host to transition to active state.
For the ResourceManager, there is no separate failover controller process.
  Instead, the ResourceManager process directly runs the ZooKeeper client,
owns the ZooKeeper session, and handles its own failover semantics.

The symptoms that you described make it sound like perhaps one of the
ZKFailoverController processes is not running or is malfunctioning.  I
recommend starting the investigation there.  Full documentation of this
architecture and its configuration is available here:

http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HDFSHi
ghAvailabilityWithQJM.html


This is more of an HDFS question than a ZooKeeper question, so for any
follow-up discussion, I recommend restarting the thread on
[email protected].

I hope this helps!

--Chris Nauroth




On 5/21/15, 6:30 AM, "[email protected]" <[email protected]> wrote:

Hello.
I have setup'ed hadoop HA-cluster with autofailoer on namenodes and
resource manager by this manuals

http://www.oracle.com/technetwork/articles/servers-storage-admin/hadoop-cl
uster-solaris-2203962.html#16
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/Resource
ManagerHA.html


So, when i halt only hadoop daemon, zookeeper swithes to active NameNode
and ResMan. But when i halt a whole server (with zookeeper member of
quorum) switches only ResMan.
I have tried many configurations.

here zoo.cfg

tickTime=2000
initLimit=5
syncLimit=2
dataDir=/var/zookeeper/data
clientPort=2181
cnxTimeout=3

server.1=name-node1:2888:3888
server.2=name-node2:2888:3888
server.3=resource-manager:2888:3888
server.4=resource-manager2:2888:3888
server.5=data-node1:2888:3888
server.6=data-node2:2888:3888

group.1=1:2:5
group.2=3:4:6

core-site.xml

   <property>
     <name>ha.zookeeper.quorum</name>
     <value>name-node1:2181,name-node2:2181,data-node1:2181</value>
   </property>

yarn-site.xml

   <property>
     <name>yarn.resourcemanager.zk-address</name>

<value>resource-manager:2181,resource-manager2:2181,data-node2:2181</value

   </property>

When i halted whole host name-node1 at zookeeper's log i see next:

2015-05-21 13:24:22,177 [myid:5] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@780] - Connection broken for
id 3, my id = 5, error =
java.io.EOFException
         at java.io.DataInputStream.readInt(DataInputStream.java:392)
         at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumC
nxManager.java:765)
2015-05-21 13:24:22,178 [myid:5] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
2015-05-21 13:24:22,179 [myid:5] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@697] - Interrupted while
waiting for message on queue
java.lang.InterruptedException
         at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.repo
rtInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
         at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai
tNanos(AbstractQueuedSynchronizer.java:2088)
         at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
         at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCn
xManager.java:849)
         at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxMa
nager.java:64)
         at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC
nxManager.java:685)
2015-05-21 13:24:22,179 [myid:5] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@706] - Send worker leaving
thread

When i halted whole host resource-manager at zookeeper's log i see next:


2015-05-21 13:24:22,990 [myid:4] - INFO  [ProcessThread(sid:4
cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
when processing sessionid:0x34d767b51ef0000 type:create cxid:0x9
zxid:0x1c0000004e txntype:-1 reqpath:n/a Error
Path:/yarn-leader-election/dph-rm/ActiveStandbyElectorLock
Error:KeeperErrorCode = NodeExists for
/yarn-leader-election/dph-rm/ActiveStandbyElectorLock

After this ResMan2 became an active.

What i am doing wrong?
Thanks.

Re: zookeeper for namenode doesn't elect active, when host is down

Reply via email to