We see this line occur frequently and the timeout will
graduatlly increase to 60000. It appears that all of
our servers that seem to be acting normally are
experiencing the cpu issue I mentioned earlier
'https://issues.apache.org/jira/browse/ZOOKEEPER-427'.
Perhaps that is causing the timeout in responding? Also
to answer your other questions Patrick, we aren't
storing a large amount of data really and network
latency appears fine. Thanks for the help, Nick
-----Original Message----- From: "Nick Bailey"
<nicholas.bai...@rackspace.com> Sent: Tuesday, January
12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org
Subject: Re: Killing a zookeeper server 12 was just to
keep uniformity on our servers. Our clients are
connecting from the same 12 servers. Easily modifiable
and perhaps we should look into changing that. The logs
just seem to indicate that the servers that claim to
have no server running are continually attempting to
elect a leader. A sample is provided below. The
initial exception is something we see regularly in our
logs and the debug and info lines following are simply
repeating throughout the log. 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN -
Exception causing close of session 0x0 due to
java.io.IOException: Read error 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG -
IOException stack trace java.io.IOException: Read error
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
at
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:
16 2) 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO -
closing session:0x0 NIOServerCnxn:
java.nio.channels.SocketChannel[connected
local=/172.20.36.9:2181 remote=/172.20.36.9:50367]
2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181]
(NIOServerCnxn) DEBUG - ignoring exception during input
shutdown java.net.SocketException: Transport endpoint
is not connected at
sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
at
sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
at
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
at
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:
16 2) 2010-01-12 17:55:52,181
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Adding vote 2010-01-12 17:55:52,181
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO - Adding vote 2010-01-12
17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 11 2010-01-12
17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO - Adding vote 2010-01-12
17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 12 2010-01-12
17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO - Adding vote 2010-01-12
17:56:52,190 [WorkerReceiver Thread]
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -
Receive new message. 2010-01-12 17:56:52,190
[WorkerReceiver Thread]
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -
Receive new message. 2010-01-12 17:56:52,191
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 8
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO - Adding vote 2010-01-12
17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 12 2010-01-12
17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO - Adding vote 2010-01-12
17:57:52,200 [WorkerReceiver Thread]
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -
Receive new message. 2010-01-12 17:57:52,201
[WorkerReceiver Thread]
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -
Receive new message. -----Original Message----- From:
"Patrick Hunt" <ph...@apache.org> Sent: Tuesday,
January 12, 2010 5:40pm To:
zookeeper-user@hadoop.apache.org,
nicholas.bai...@rackspace.com Subject: Re: Killing a
zookeeper server 12 servers? That's alot, if you dont'
mind my asking why so many? Typically we recommend 5 -
that way you can have one down for maintenance and
still have a failure that doesn't bring down the
cluster. The "electing a leader" is probably the
restarted machine attempting to re-join the ensemble
(it should join as a follower if you have a leader
already elected, given that it's xid is behind the
existing leader.) Hard to tell though without the logs.
You might also be seeing the initLimit exceeded, is
the data you are storing in ZK large? Or perhaps
network connectivity is slow?
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_cl
us terOptions again the logs would give some insight
on this. Patrick Nick Bailey wrote:
We are running zookeeper 3.1.0
Recently we noticed the cpu usage on our machines
becoming increasingly high and we believe the cause
is
https://issues.apache.org/jira/browse/ZOOKEEPER-427
However our solution when we noticed the problem was
to kill the zookeeper process and restart it.
After doing that though it looks like the newly
restarted zookeeper server is continually attempting
to elect a leader even though one already exists.
The process responses with 'imok' when asked, but the
stat command returns 'ZooKeeperServer not running'.
I belive that killing the current leader should
trigger all servers to do an election and solve the
problem, but I'm not sure. Should that be the course
of action in this situation?
Also we have 12 servers, but 5 are currently not
running according to stat. So I guess this isn't a
problem unless we lose another one. We have plans to
upgrade zookeeper to solve the cpu issue but haven't
been able to do that yet.
Any help appreciated, Nick Bailey