On a related note, it was initially confusing to me that the server
returned 'imok' when it wasn't part of the quorum. I realize the
internal checks are probably in separate areas of the code, but if
others feel similarly I could file an improvement in JIRA.

.. Adam

On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey <ni...@mailtrust.com> wrote:
> So the solution for us was to just nuke zookeeper and restart everywhere.
>  We will also be upgrading soon as well.
>
> To answer your question, yes I believe all the servers were running normally
> except for the fact that they were experiencing high CPU usage.  As we began
> to see some CPU alerts I started restarting some of the servers.
>
> It was then that we noticed that they were not actually running according to
> 'stat'.
>
> I still have the log from one server with a debug level and the rest with a
> warn level. If you would like to see any of these and analyze them just let
> me know.
>
> Thanks for the help,
> Nick Bailey
>
> On Jan 12, 2010, at 8:20 PM, Patrick Hunt <ph...@apache.org> wrote:
>
>> Nick Bailey wrote:
>>>
>>> In my last email I failded to include a log line that may be revelent as
>>> well
>>> 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager)
>>> DEBUG - Queue size: 0
>>> 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Notification time out: 6400
>>
>> Yes, that is significant/interesting. I believe this means that there is
>> some problem with the election process (ie the server re-joining the
>> ensemble). We have a backoff on these attempts, which matches your
>> description below. We have fixed some election issues in recent versions (we
>> introduced fault injection testing prior to the 3.2.1 release which found a
>> few issues with election). I don't have them off hand - but I've asked
>> Flavio to comment directly (he's in diff tz).
>>
>> Can you provide a bit more background: prior to this issue, this
>> particular server was running fine? You restarted it and then started seeing
>> the issue? (rather than this being a new server I mean). What I'm getting at
>> is that there shouldn't/couldn't be any networking/firewall type issue going
>> on right?
>>
>> Can you provide a full/more log? What I'd suggest is shut down this one
>> server, clear the log4j log file, then restart it. Let the problem
>> reproduce, then gzip the log4j log file and attach to your response. Ok?
>>
>> Patrick
>>
>>> We see this line occur frequently and the timeout will graduatlly
>>> increase to 60000. It appears that all of our servers that seem to be acting
>>> normally are experiencing the cpu issue I mentioned earlier
>>> 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is
>>> causing the timeout in responding?
>>> Also to answer your other questions Patrick, we aren't storing a large
>>> amount of data really and network latency appears fine.
>>> Thanks for the help,
>>> Nick
>>> -----Original Message-----
>>> From: "Nick Bailey" <nicholas.bai...@rackspace.com>
>>> Sent: Tuesday, January 12, 2010 6:03pm
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: Killing a zookeeper server
>>> 12 was just to keep uniformity on our servers. Our clients are connecting
>>> from the same 12 servers.  Easily modifiable and perhaps we should look into
>>> changing that.
>>> The logs just seem to indicate that the servers that claim to have no
>>> server running are continually attempting to elect a leader.  A sample is
>>> provided below.  The initial exception is something we see regularly in our
>>> logs and the debug and info lines following are simply repeating throughout
>>> the log.
>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN
>>>  - Exception causing close of session 0x0 due to java.io.IOException: Read
>>> error
>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
>>> - IOException stack trace
>>> java.io.IOException: Read error
>>>       at
>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
>>>       at
>>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO
>>>  - closing session:0x0 NIOServerCnxn:
>>> java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181
>>> remote=/172.20.36.9:50367]
>>> 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
>>> - ignoring exception during input shutdown
>>> java.net.SocketException: Transport endpoint is not connected
>>>       at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>>>       at
>>> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>>>       at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>>>       at
>>> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
>>>       at
>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
>>>       at
>>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Adding vote
>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9
>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Adding vote
>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 11
>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Adding vote
>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Adding vote
>>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread]
>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
>>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread]
>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 8
>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Adding vote
>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>> INFO  - Adding vote
>>> 2010-01-12 17:57:52,200 [WorkerReceiver Thread]
>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
>>> 2010-01-12 17:57:52,201 [WorkerReceiver Thread]
>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
>>> -----Original Message-----
>>> From: "Patrick Hunt" <ph...@apache.org>
>>> Sent: Tuesday, January 12, 2010 5:40pm
>>> To: zookeeper-user@hadoop.apache.org, nicholas.bai...@rackspace.com
>>> Subject: Re: Killing a zookeeper server
>>> 12 servers? That's alot, if you dont' mind my asking why so many?
>>> Typically we recommend 5 - that way you can have one down for maintenance
>>> and still have a failure that doesn't bring down the cluster.
>>> The "electing a leader" is probably the restarted machine attempting to
>>> re-join the ensemble (it should join as a follower if you have a leader
>>> already elected, given that it's xid is behind the existing leader.) Hard to
>>> tell though without the logs.
>>> You might also be seeing the initLimit exceeded, is the data you are
>>> storing in ZK large? Or perhaps network connectivity is slow?
>>>
>>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
>>> again the logs would give some insight on this.
>>> Patrick
>>> Nick Bailey wrote:
>>>>
>>>> We are running zookeeper 3.1.0
>>>>
>>>> Recently we noticed the cpu usage on our machines becoming
>>>> increasingly high and we believe the cause is
>>>>
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-427
>>>>
>>>> However our solution when we noticed the problem was to kill the
>>>> zookeeper process and restart it.
>>>>
>>>> After doing that though it looks like the newly restarted zookeeper
>>>> server is continually attempting to elect a leader even though one
>>>> already exists.
>>>>
>>>> The process responses with 'imok' when asked, but the stat command
>>>> returns 'ZooKeeperServer not running'.
>>>>
>>>> I belive that killing the current leader should trigger all servers
>>>> to do an election and solve the problem, but I'm not sure. Should
>>>> that be the course of action in this situation?
>>>>
>>>> Also we have 12 servers, but 5 are currently not running according to
>>>> stat.  So I guess this isn't a problem unless we lose another one.
>>>> We have plans to upgrade zookeeper to solve the cpu issue but haven't
>>>> been able to do that yet.
>>>>
>>>> Any help appreciated, Nick Bailey
>>>>
>

Reply via email to