Hello Tao, For the reason of the logs "Cannot open channel to ..." see my first reply. I believe they are not relevant for your problem, so let's forget about them for now.
You are saying in your mail that command "echo stat | nc 127.0.0.1 2181" reports "Mode: follower(/leader)" also in the *bad cluster*? That is very strange. If that is the case, then "bin/zkServer.sh status" should also work ok. Please confirm. If you found the line TOOK in the bad cluster, that means leader election was successful. Next thing that should be there (most likely a few lines below) is this (only in the leader): "Have quorum of supporters ..." Do you see something like this? "[main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181" (grep "binding to port") Could you also please check that you have a normal DNS resolution of localhost in the bad cluster? Regards, German. On Fri, Feb 14, 2014 at 10:55 AM, Tao Xiao <[email protected]> wrote: > Hi German, > > Thanks for your reply. > > I checked logs on all the three nodes for the two clusters(the *good > cluster* where I can query the status through "bin/zkServer.sh status" and > the* bad cluster* where executing the same command would report "Error > contacting service. It is probably not running.") respectively. > > > *The 1st and 2nd nods of both clusters have the following warnings in their > logs:* > .... .... > 2014-02-11 15:42:15,421 [myid:2] - WARN > [WorkerSender[myid=2]:QuorumCnxManager@368] - Cannot open channel to 3 at > election address imon-3/172.16.38.145:3888 > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) > at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) > at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) > at java.net.Socket.connect(Socket.java:529) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327) > at > > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393) > at > > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365) > at java.lang.Thread.run(Thread.java:662) > .... .... > .... .... > 2014-02-11 17:37:43,141 [myid:2] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception > EndOfStreamException: Unable to read additional data from client sessionid > 0x2441fe57a7e000e, likely client has closed socket > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) > at > > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) > at java.lang.Thread.run(Thread.java:662) > .... .... > > > *The 3rd node of the good cluster has only one kind of warnings as > follows:* > ... ... > 2014-02-13 17:02:36,005 [myid:3] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to establish > new session at /172.16.16.170:2345 > 2014-02-13 17:02:36,038 [myid:3] - INFO > [CommitProcessor:3:ZooKeeperServer@595] - Established session > 0x3440a225c00002d with negotiated timeout 40000 for client / > 172.16.16.170:2345 > 2014-02-14 04:20:19,622 [myid:3] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception > EndOfStreamException: Unable to read additional data from client sessionid > 0x3440a225c000026, likely client has closed socket > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) > at > > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) > at java.lang.Thread.run(Thread.java:662) > 2014-02-14 04:20:19,623 [myid:3] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for > client /172.16.66.134:55513 which had sessionid 0x3440a225c000026 > 2014-02-14 10:04:02,661 [myid:3] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of > session 0x3440a225c00002d due to java.io.IOException: Connection reset by > peer > 2014-02-14 10:04:02,662 [myid:3] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for > client /172.16.16.170:2345 which had sessionid 0x3440a225c00002d > 2014-02-14 16:52:16,837 [myid:3] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket > connection > from /127.0.0.1:42057 > 2014-02-14 16:52:16,874 [myid:3] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@821] - Processing stat command from / > 127.0.0.1:42057 > 2014-02-14 16:52:16,896 [myid:3] - INFO > [Thread-3:NIOServerCnxn$StatCommand@655] - Stat command output > 2014-02-14 16:52:16,897 [myid:3] - INFO [Thread-3:NIOServerCnxn@1001] - > Closed socket connection for client /127.0.0.1:42057 (no session > established for client) > ... ... > > > *The 3rd node of the bad cluster has no warnings at all.* > > I checked the status for each of the servers, for both clusters > respectively. Executing the command "bin/zkServer.sh status" reported > "Error > contacting service. It is probably not running." on each of the node of the > bad cluster, whereas it reported "mode: follower/leader" on each node of > the good cluster. > > > I executed the command "echo stat | nc 127.0.0.1 2181" on each node of the > two clusters. All nodes printed the information similar to the following: > Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT > Clients: > /172.16.38.145:35045[1](queued=0,recved=19750,sent=19750) > /127.0.0.1:54713[0](queued=0,recved=1,sent=0) > /172.16.38.144:35483[1](queued=0,recved=19788,sent=19797) > /172.16.38.144:47445[1](queued=0,recved=1861,sent=1861) > Latency min/avg/max: 0/0/68 > Received: 42751 > Sent: 42882 > Connections: 4 > Outstanding: 0 > Zxid: 0x20000013d > Mode: follower(/leader) > Node count: 27 > > > > Besides, I looked for lines containing "TOOK" in the logs, and found that > the log file on each node of the two clusters has only one line containing > "TOOK". For all the nodes, there are other logging info following the line > containing "TOOK", and I'm not sure how to confirm whether the leader > election was successful. > 2014-02-07 10:17:08,311 [myid:2] - INFO > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@162] - Created > server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 > datadir /storage/disk1/zookeeper/data.dir/version-2 snapdir > /storage/disk1/zookeeper/data.dir/version-2 > 2014-02-07 10:17:08,312 [myid:2] - INFO > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@345] - LEADING - LEADER > ELECTION TOOK - 610 > 2014-02-07 10:17:08,320 [myid:2] - INFO > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot > /storage/disk1/zookeeper/data.dir/version-2/snapshot.1f00053d5d > 2014-02-07 10:17:08,749 [myid:2] - INFO > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - > Snapshotting: 0x1f00063340 to > /storage/disk1/zookeeper/data.dir/version-2/snapshot.1f00063340 > 2014-02-07 10:17:09,222 [myid:2] - INFO > [LearnerHandler-/172.16.66.129:39195:LearnerHandler@263] - Follower sid: > 1 > : info : > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@5ddc0e7a > 2014-02-07 10:17:09,347 [myid:2] - INFO > [LearnerHandler-/172.16.66.129:39195:LearnerHandler@318] - Synchronizing > with Follower sid: 1 maxCommittedLog=0x1f00063340 > minCommittedLog=0x1f0006314c peerLastZxid=0x1f00063340 > 2014-02-07 10:17:09,348 [myid:2] - INFO > [LearnerHandler-/172.16.66.129:39195:LearnerHandler@395] - Sending DIFF > 2014-02-07 10:17:09,396 [myid:2] - WARN > [LearnerHandler-/172.16.66.129:39195:Leader@574] - Commiting zxid > 0x2000000000 from /172.16.66.130:2888 not first! > 2014-02-07 10:17:09,397 [myid:2] - WARN > [LearnerHandler-/172.16.66.129:39195:Leader@576] - First is 0x0 > > > Hope the info above can help you find the reason. Thanks! > > > Tao > > > 2014-02-13 18:05 GMT+08:00 German Blanco <[email protected]>: > > > I can´t explain it with this information. > > It could be e.g. that in the first cluster, you have a working ensemble > of > > two of the nodes and the "bin/zkServer.sh status" is trying to connect to > > the one that doesn´t work. > > I would start by checking the status of each of the servers (not just > one). > > Try out the 4 letter words commands: > > https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands > > They give a lot of information. > > Also, look in the server logs for the last successful leader election > (try > > grep for "TOOK" to do this) and the last election started (try grep for > > "New election"). If the last thing in the log is a "TOOK" then things > > should be fine, if it is a "New election" then there might be a problem. > > Then please post what you find out. > > > > Regards, > > > > German. > > > > > > On Thu, Feb 13, 2014 at 7:18 AM, Tao Xiao <[email protected]> > > wrote: > > > > > Hi German, > > > > > > I installed ZooKeeper in another cluster and it works well. I mean I > can > > > query the status through "bin/zkServer.sh status", how to explain that? > > > > > > > > > 2014-02-11 18:12 GMT+08:00 German Blanco < > [email protected] > > >: > > > > > > > Hello, > > > > > > > > It doesn't matter. > > > > The reason is that each zookeeper servers upon start-up try to > > establish > > > > connections with all other servers in its configuration. > > > > However, only one connection for pair should be there, so half of the > > > > connections are dropped. > > > > Right after connection servers send their ids. If the id of the peer > > > > initiating the connection is higher than the peer receiving the > > > connection, > > > > then everything proceeds, otherwise connection is rejected. > > > > Why do peers with a lower id try to open connections with the peers > > with > > > a > > > > higher id then? > > > > Because in that way, they trigger a connection attempt in the other > > > > direction. That is, say servers 2 and 3 have a working ensemble, they > > are > > > > not going to attempt to connect to server 1 unless something happens. > > > > Server 1 wakes up, it attempts to connect to servers 2 and 3 and > these > > > two > > > > connection attempts fail, but each triggers a connection attempt from > > the > > > > respective peer that succeeds. > > > > This was a bit more than you asked for, but anyway I hope it helps > :-). > > > If > > > > the answer doesn't work for you, please let me know. > > > > > > > > Best regards, > > > > > > > > German. > > > > > > > > > > > > On Tue, Feb 11, 2014 at 10:53 AM, Tao Xiao <[email protected] > > > > > > wrote: > > > > > > > > > I installed zookeeper 3.4.5 in a 3-node cluster and started it. I > > think > > > > > zookeeper works well because the HBase cluster, which relies on > > > > zookeeper, > > > > > indeed works well. But when I tried to query zookeeper's status, it > > > > > reported: > > > > > > > > > > [root@imon-1 zookeeper-3.4.5]# bin/zkServer.sh status > > > > > JMX enabled by default > > > > > Using config: /usr/local/apache/zookeeper-3.4.5/bin/../conf/zoo.cfg > > > > > Error contacting service. It is probably not running. > > > > > > > > > > > > > > > I checked the log and found it reported the following: > > > > > > > > > > 2014-02-11 15:42:15,623 [myid:] - INFO [main:QuorumPeerConfig@101 > ] > > - > > > > > Reading configuration from: > > > > > /usr/local/apache/zookeeper-3.4.5/bin/../conf/zoo.cfg > > > > > 2014-02-11 15:42:15,629 [myid:] - INFO [main:QuorumPeerConfig@334 > ] > > - > > > > > Defaulting to majority quorums > > > > > 2014-02-11 15:42:15,652 [myid:1] - INFO > > [main:DatadirCleanupManager@78 > > > ] > > > > - > > > > > autopurge.snapRetainCount set to 3 > > > > > 2014-02-11 15:42:15,652 [myid:1] - INFO > > [main:DatadirCleanupManager@79 > > > ] > > > > - > > > > > autopurge.purgeInterval set to 0 > > > > > 2014-02-11 15:42:15,653 [myid:1] - INFO > > > [main:DatadirCleanupManager@101 > > > > ] > > > > > - > > > > > Purge task is not scheduled. > > > > > 2014-02-11 15:42:15,683 [myid:1] - INFO [main:QuorumPeerMain@127] > - > > > > > Starting quorum peer > > > > > 2014-02-11 15:42:15,726 [myid:1] - INFO > > [main:NIOServerCnxnFactory@94 > > > ] > > > > - > > > > > binding to port 0.0.0.0/0.0.0.0:2181 > > > > > 2014-02-11 15:42:15,759 [myid:1] - INFO [main:QuorumPeer@913] - > > > > tickTime > > > > > set to 2000 > > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO [main:QuorumPeer@933] - > > > > > minSessionTimeout set to -1 > > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO [main:QuorumPeer@944] - > > > > > maxSessionTimeout set to -1 > > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO [main:QuorumPeer@959] - > > > > initLimit > > > > > set to 10 > > > > > 2014-02-11 15:42:15,874 [myid:1] - INFO [main:FileSnap@83] - > > Reading > > > > > snapshot /var/data/zk/dataDir/version-2/snapshot.100072a68 > > > > > 2014-02-11 15:42:18,211 [myid:1] - INFO > > > > > [Thread-1:QuorumCnxManager$Listener@486] - My election bind port: > > > > > 0.0.0.0/0.0.0.0:3888 > > > > > 2014-02-11 15:42:18,226 [myid:1] - INFO [QuorumPeer[myid=1]/ > > > > 0.0.0.0:2181 > > > > > :QuorumPeer@670] - LOOKING > > > > > 2014-02-11 15:42:18,232 [myid:1] - INFO [QuorumPeer[myid=1]/ > > > > 0.0.0.0:2181 > > > > > :FastLeaderElection@740] - New election. My id = 1, proposed > > > > > zxid=0x1000735ac > > > > > 2014-02-11 15:42:18,234 [myid:1] - INFO > > > > > [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 1 > > > > > (n.leader), 0x1000735ac (n.zxid), 0x1 (n.round), LOOKING > (n.state), 1 > > > > > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state) > > > > > 2014-02-11 15:42:18,256 [myid:1] - WARN > > > > > [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel > > to > > > 2 > > > > at > > > > > election address imon-2/172.16.38.144:3888 > > > > > java.net.ConnectException: Connection refused > > > > > at java.net.PlainSocketImpl.socketConnect(Native Method) > > > > > at > > java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) > > > > > at > > > > > java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) > > > > > at > java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) > > > > > at > java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) > > > > > at java.net.Socket.connect(Socket.java:529) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365) > > > > > at java.lang.Thread.run(Thread.java:662) > > > > > 2014-02-11 15:42:18,268 [myid:1] - WARN > > > > > [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel > > to > > > 3 > > > > at > > > > > election address imon-3/172.16.38.145:3888 > > > > > java.net.ConnectException: Connection refused > > > > > at java.net.PlainSocketImpl.socketConnect(Native Method) > > > > > at > > java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) > > > > > at > > > > > java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) > > > > > at > java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) > > > > > at > java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) > > > > > at java.net.Socket.connect(Socket.java:529) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365) > > > > > at java.lang.Thread.run(Thread.java:662) > > > > > 2014-02-11 15:42:18,436 [myid:1] - WARN [QuorumPeer[myid=1]/ > > > > 0.0.0.0:2181 > > > > > :QuorumCnxManager@368] - Cannot open channel to 2 at election > > address > > > > > imon-2/172.16.38.144:3888 > > > > > java.net.ConnectException: Connection refused > > > > > > > > > > ... ... > > > > > > > > > > > > > > > Do these warnings matter? What is the reason. > > > > > > > > > > > > > > >
