Re: zookeeper works well but log reports "Cannot open channel to 2 at election address ... java.net.ConnectException: Connection refused"

German Blanco Fri, 14 Feb 2014 04:08:25 -0800

Hello Tao,

For the reason of the logs "Cannot open channel to ..." see my first reply.
I believe they are not relevant for your problem, so let's forget about
them for now.


You are saying in your mail that command "echo stat | nc 127.0.0.1 2181"
reports "Mode: follower(/leader)" also in the *bad cluster*?
That is very strange. If that is the case, then "bin/zkServer.sh status"
should also work ok. Please confirm.

If you found the line TOOK in the bad cluster, that means leader election
was successful.

Next thing that should be there (most likely a few lines below) is this
(only in the leader):
"Have quorum of supporters ..."

Do you see something like this?
"[main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181"
(grep "binding to port")

Could you also please check that you have a normal DNS resolution of
localhost in the bad cluster?

Regards,

German.


On Fri, Feb 14, 2014 at 10:55 AM, Tao Xiao <[email protected]> wrote:

> Hi German,
>
> Thanks for your reply.
>
> I checked logs on all the three nodes for the two clusters(the *good
> cluster* where  I can query the status through "bin/zkServer.sh status" and
> the* bad cluster* where executing the same command would report "Error
> contacting service. It is probably not running.") respectively.
>
>
> *The 1st and 2nd nods of both clusters have the following warnings in their
> logs:*
> .... ....
> 2014-02-11 15:42:15,421 [myid:2] - WARN
>  [WorkerSender[myid=2]:QuorumCnxManager@368] - Cannot open channel to 3 at
> election address imon-3/172.16.38.145:3888
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:529)
>     at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
>     at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
>     at
>
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
>     at
>
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
>     at java.lang.Thread.run(Thread.java:662)
> .... ....
> .... ....
> 2014-02-11 17:37:43,141 [myid:2] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x2441fe57a7e000e, likely client has closed socket
>     at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
>     at
>
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>     at java.lang.Thread.run(Thread.java:662)
> .... ....
>
>
> *The 3rd node of the good cluster has only one kind of warnings as
> follows:*
> ... ...
> 2014-02-13 17:02:36,005 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client attempting to establish
> new session at /172.16.16.170:2345
> 2014-02-13 17:02:36,038 [myid:3] - INFO
>  [CommitProcessor:3:ZooKeeperServer@595] - Established session
> 0x3440a225c00002d with negotiated timeout 40000 for client /
> 172.16.16.170:2345
> 2014-02-14 04:20:19,622 [myid:3] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x3440a225c000026, likely client has closed socket
>         at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
>         at
>
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>         at java.lang.Thread.run(Thread.java:662)
> 2014-02-14 04:20:19,623 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /172.16.66.134:55513 which had sessionid 0x3440a225c000026
> 2014-02-14 10:04:02,661 [myid:3] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of
> session 0x3440a225c00002d due to java.io.IOException: Connection reset by
> peer
> 2014-02-14 10:04:02,662 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
> client /172.16.16.170:2345 which had sessionid 0x3440a225c00002d
> 2014-02-14 16:52:16,837 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> connection
> from /127.0.0.1:42057
> 2014-02-14 16:52:16,874 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@821] - Processing stat command from /
> 127.0.0.1:42057
> 2014-02-14 16:52:16,896 [myid:3] - INFO
>  [Thread-3:NIOServerCnxn$StatCommand@655] - Stat command output
> 2014-02-14 16:52:16,897 [myid:3] - INFO  [Thread-3:NIOServerCnxn@1001] -
> Closed socket connection for client /127.0.0.1:42057 (no session
> established for client)
> ... ...
>
>
> *The 3rd node of the bad cluster has no warnings at all.*
>
> I checked the status for each of the servers, for both clusters
> respectively. Executing the command "bin/zkServer.sh status" reported
> "Error
> contacting service. It is probably not running." on each of the node of the
> bad cluster, whereas it reported "mode: follower/leader" on each node of
> the good cluster.
>
>
> I executed the command "echo stat | nc 127.0.0.1 2181" on each node of the
> two clusters. All nodes printed the information similar to the following:
> Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
> Clients:
>  /172.16.38.145:35045[1](queued=0,recved=19750,sent=19750)
>  /127.0.0.1:54713[0](queued=0,recved=1,sent=0)
>  /172.16.38.144:35483[1](queued=0,recved=19788,sent=19797)
>  /172.16.38.144:47445[1](queued=0,recved=1861,sent=1861)
> Latency min/avg/max: 0/0/68
> Received: 42751
> Sent: 42882
> Connections: 4
> Outstanding: 0
> Zxid: 0x20000013d
> Mode: follower(/leader)
> Node count: 27
>
>
>
> Besides, I looked for lines containing "TOOK" in the logs, and found that
> the log file on each node of the two clusters has only one line containing
> "TOOK". For all the nodes, there are other logging info following the line
> containing "TOOK", and I'm not sure how to confirm whether the leader
> election was successful.
> 2014-02-07 10:17:08,311 [myid:2] - INFO
>  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@162] - Created
> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
> datadir /storage/disk1/zookeeper/data.dir/version-2 snapdir
> /storage/disk1/zookeeper/data.dir/version-2
> 2014-02-07 10:17:08,312 [myid:2] - INFO
>  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Leader@345] - LEADING - LEADER
> ELECTION TOOK - 610
> 2014-02-07 10:17:08,320 [myid:2] - INFO
>  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
> /storage/disk1/zookeeper/data.dir/version-2/snapshot.1f00053d5d
> 2014-02-07 10:17:08,749 [myid:2] - INFO
>  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
> Snapshotting: 0x1f00063340 to
> /storage/disk1/zookeeper/data.dir/version-2/snapshot.1f00063340
> 2014-02-07 10:17:09,222 [myid:2] - INFO
>  [LearnerHandler-/172.16.66.129:39195:LearnerHandler@263] - Follower sid:
> 1
> : info :
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@5ddc0e7a
> 2014-02-07 10:17:09,347 [myid:2] - INFO
>  [LearnerHandler-/172.16.66.129:39195:LearnerHandler@318] - Synchronizing
> with Follower sid: 1 maxCommittedLog=0x1f00063340
> minCommittedLog=0x1f0006314c peerLastZxid=0x1f00063340
> 2014-02-07 10:17:09,348 [myid:2] - INFO
>  [LearnerHandler-/172.16.66.129:39195:LearnerHandler@395] - Sending DIFF
> 2014-02-07 10:17:09,396 [myid:2] - WARN
>  [LearnerHandler-/172.16.66.129:39195:Leader@574] - Commiting zxid
> 0x2000000000 from /172.16.66.130:2888 not first!
> 2014-02-07 10:17:09,397 [myid:2] - WARN
>  [LearnerHandler-/172.16.66.129:39195:Leader@576] - First is 0x0
>
>
> Hope the info above can help you find the reason. Thanks!
>
>
> Tao
>
>
> 2014-02-13 18:05 GMT+08:00 German Blanco <[email protected]>:
>
> > I can´t explain it with this information.
> > It could be e.g. that in the first cluster, you have a working ensemble
> of
> > two of the nodes and the "bin/zkServer.sh status" is trying to connect to
> > the one that doesn´t work.
> > I would start by checking the status of each of the servers (not just
> one).
> > Try out the 4 letter words commands:
> > https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands
> > They give a lot of information.
> > Also, look in the server logs for the last successful leader election
> (try
> > grep for "TOOK" to do this) and the last election started (try grep for
> > "New election"). If the last thing in the log is a "TOOK" then things
> > should be fine, if it is a "New election" then there might be a problem.
> > Then please post what you find out.
> >
> > Regards,
> >
> > German.
> >
> >
> > On Thu, Feb 13, 2014 at 7:18 AM, Tao Xiao <[email protected]>
> > wrote:
> >
> > > Hi German,
> > >
> > > I installed ZooKeeper in another cluster and it works well. I mean I
> can
> > > query the status through "bin/zkServer.sh status", how to explain that?
> > >
> > >
> > > 2014-02-11 18:12 GMT+08:00 German Blanco <
> [email protected]
> > >:
> > >
> > > > Hello,
> > > >
> > > > It doesn't matter.
> > > > The reason is that each zookeeper servers upon start-up try to
> > establish
> > > > connections with all other servers in its configuration.
> > > > However, only one connection for pair should be there, so half of the
> > > > connections are dropped.
> > > > Right after connection servers send their ids. If the id of the peer
> > > > initiating the connection is higher than the peer receiving the
> > > connection,
> > > > then everything proceeds, otherwise connection is rejected.
> > > > Why do peers with a lower id try to open connections with the peers
> > with
> > > a
> > > > higher id then?
> > > > Because in that way, they trigger a connection attempt in the other
> > > > direction. That is, say servers 2 and 3 have a working ensemble, they
> > are
> > > > not going to attempt to connect to server 1 unless something happens.
> > > > Server 1 wakes up, it attempts to connect to servers 2 and 3 and
> these
> > > two
> > > > connection attempts fail, but each triggers a connection attempt from
> > the
> > > > respective peer that succeeds.
> > > > This was a bit more than you asked for, but anyway I hope it helps
> :-).
> > > If
> > > > the answer doesn't work for you, please let me know.
> > > >
> > > > Best regards,
> > > >
> > > > German.
> > > >
> > > >
> > > > On Tue, Feb 11, 2014 at 10:53 AM, Tao Xiao <[email protected]
> >
> > > > wrote:
> > > >
> > > > > I installed zookeeper 3.4.5 in a 3-node cluster and started it. I
> > think
> > > > > zookeeper works well because the HBase cluster, which relies on
> > > > zookeeper,
> > > > > indeed works well. But when I tried to query zookeeper's status, it
> > > > > reported:
> > > > >
> > > > > [root@imon-1 zookeeper-3.4.5]# bin/zkServer.sh status
> > > > > JMX enabled by default
> > > > > Using config: /usr/local/apache/zookeeper-3.4.5/bin/../conf/zoo.cfg
> > > > > Error contacting service. It is probably not running.
> > > > >
> > > > >
> > > > > I checked the log and found it reported the following:
> > > > >
> > > > > 2014-02-11 15:42:15,623 [myid:] - INFO  [main:QuorumPeerConfig@101
> ]
> > -
> > > > > Reading configuration from:
> > > > > /usr/local/apache/zookeeper-3.4.5/bin/../conf/zoo.cfg
> > > > > 2014-02-11 15:42:15,629 [myid:] - INFO  [main:QuorumPeerConfig@334
> ]
> > -
> > > > > Defaulting to majority quorums
> > > > > 2014-02-11 15:42:15,652 [myid:1] - INFO
> >  [main:DatadirCleanupManager@78
> > > ]
> > > > -
> > > > > autopurge.snapRetainCount set to 3
> > > > > 2014-02-11 15:42:15,652 [myid:1] - INFO
> >  [main:DatadirCleanupManager@79
> > > ]
> > > > -
> > > > > autopurge.purgeInterval set to 0
> > > > > 2014-02-11 15:42:15,653 [myid:1] - INFO
> > >  [main:DatadirCleanupManager@101
> > > > ]
> > > > > -
> > > > > Purge task is not scheduled.
> > > > > 2014-02-11 15:42:15,683 [myid:1] - INFO  [main:QuorumPeerMain@127]
> -
> > > > > Starting quorum peer
> > > > > 2014-02-11 15:42:15,726 [myid:1] - INFO
> >  [main:NIOServerCnxnFactory@94
> > > ]
> > > > -
> > > > > binding to port 0.0.0.0/0.0.0.0:2181
> > > > > 2014-02-11 15:42:15,759 [myid:1] - INFO  [main:QuorumPeer@913] -
> > > > tickTime
> > > > > set to 2000
> > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@933] -
> > > > > minSessionTimeout set to -1
> > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@944] -
> > > > > maxSessionTimeout set to -1
> > > > > 2014-02-11 15:42:15,760 [myid:1] - INFO  [main:QuorumPeer@959] -
> > > > initLimit
> > > > > set to 10
> > > > > 2014-02-11 15:42:15,874 [myid:1] - INFO  [main:FileSnap@83] -
> > Reading
> > > > > snapshot /var/data/zk/dataDir/version-2/snapshot.100072a68
> > > > > 2014-02-11 15:42:18,211 [myid:1] - INFO
> > > > >  [Thread-1:QuorumCnxManager$Listener@486] - My election bind port:
> > > > > 0.0.0.0/0.0.0.0:3888
> > > > > 2014-02-11 15:42:18,226 [myid:1] - INFO  [QuorumPeer[myid=1]/
> > > > 0.0.0.0:2181
> > > > > :QuorumPeer@670] - LOOKING
> > > > > 2014-02-11 15:42:18,232 [myid:1] - INFO  [QuorumPeer[myid=1]/
> > > > 0.0.0.0:2181
> > > > > :FastLeaderElection@740] - New election. My id =  1, proposed
> > > > > zxid=0x1000735ac
> > > > > 2014-02-11 15:42:18,234 [myid:1] - INFO
> > > > >  [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 1
> > > > > (n.leader), 0x1000735ac (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 1
> > > > > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
> > > > > 2014-02-11 15:42:18,256 [myid:1] - WARN
> > > > >  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel
> > to
> > > 2
> > > > at
> > > > > election address imon-2/172.16.38.144:3888
> > > > > java.net.ConnectException: Connection refused
> > > > >         at java.net.PlainSocketImpl.socketConnect(Native Method)
> > > > >         at
> > java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> > > > >         at
> > > > > java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> > > > >         at
> java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> > > > >         at
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> > > > >         at java.net.Socket.connect(Socket.java:529)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
> > > > >         at java.lang.Thread.run(Thread.java:662)
> > > > > 2014-02-11 15:42:18,268 [myid:1] - WARN
> > > > >  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel
> > to
> > > 3
> > > > at
> > > > > election address imon-3/172.16.38.145:3888
> > > > > java.net.ConnectException: Connection refused
> > > > >         at java.net.PlainSocketImpl.socketConnect(Native Method)
> > > > >         at
> > java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> > > > >         at
> > > > > java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> > > > >         at
> java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> > > > >         at
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> > > > >         at java.net.Socket.connect(Socket.java:529)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
> > > > >         at java.lang.Thread.run(Thread.java:662)
> > > > > 2014-02-11 15:42:18,436 [myid:1] - WARN  [QuorumPeer[myid=1]/
> > > > 0.0.0.0:2181
> > > > > :QuorumCnxManager@368] - Cannot open channel to 2 at election
> > address
> > > > > imon-2/172.16.38.144:3888
> > > > > java.net.ConnectException: Connection refused
> > > > >
> > > > > ... ...
> > > > >
> > > > >
> > > > > Do these warnings matter?  What is the reason.
> > > > >
> > > >
> > >
> >
>

Re: zookeeper works well but log reports "Cannot open channel to 2 at election address ... java.net.ConnectException: Connection refused"

Reply via email to