Re: Killing a zookeeper server

Mahadev Konar Wed, 13 Jan 2010 13:10:07 -0800

Hi Adam,
  That seems fair to file as an improvement. Running 'stat' did return the
right stats right? Saying the servers werent able to elect a leader?


mahadev


On 1/13/10 11:52 AM, "Adam Rosien" <a...@rosien.net> wrote:

> On a related note, it was initially confusing to me that the server
> returned 'imok' when it wasn't part of the quorum. I realize the
> internal checks are probably in separate areas of the code, but if
> others feel similarly I could file an improvement in JIRA.
> 
> .. Adam
> 
> On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey <ni...@mailtrust.com> wrote:
>> So the solution for us was to just nuke zookeeper and restart everywhere.
>>  We will also be upgrading soon as well.
>> 
>> To answer your question, yes I believe all the servers were running normally
>> except for the fact that they were experiencing high CPU usage.  As we began
>> to see some CPU alerts I started restarting some of the servers.
>> 
>> It was then that we noticed that they were not actually running according to
>> 'stat'.
>> 
>> I still have the log from one server with a debug level and the rest with a
>> warn level. If you would like to see any of these and analyze them just let
>> me know.
>> 
>> Thanks for the help,
>> Nick Bailey
>> 
>> On Jan 12, 2010, at 8:20 PM, Patrick Hunt <ph...@apache.org> wrote:
>> 
>>> Nick Bailey wrote:
>>>> 
>>>> In my last email I failded to include a log line that may be revelent as
>>>> well
>>>> 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager)
>>>> DEBUG - Queue size: 0
>>>> 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Notification time out: 6400
>>> 
>>> Yes, that is significant/interesting. I believe this means that there is
>>> some problem with the election process (ie the server re-joining the
>>> ensemble). We have a backoff on these attempts, which matches your
>>> description below. We have fixed some election issues in recent versions (we
>>> introduced fault injection testing prior to the 3.2.1 release which found a
>>> few issues with election). I don't have them off hand - but I've asked
>>> Flavio to comment directly (he's in diff tz).
>>> 
>>> Can you provide a bit more background: prior to this issue, this
>>> particular server was running fine? You restarted it and then started seeing
>>> the issue? (rather than this being a new server I mean). What I'm getting at
>>> is that there shouldn't/couldn't be any networking/firewall type issue going
>>> on right?
>>> 
>>> Can you provide a full/more log? What I'd suggest is shut down this one
>>> server, clear the log4j log file, then restart it. Let the problem
>>> reproduce, then gzip the log4j log file and attach to your response. Ok?
>>> 
>>> Patrick
>>> 
>>>> We see this line occur frequently and the timeout will graduatlly
>>>> increase to 60000. It appears that all of our servers that seem to be
>>>> acting
>>>> normally are experiencing the cpu issue I mentioned earlier
>>>> 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is
>>>> causing the timeout in responding?
>>>> Also to answer your other questions Patrick, we aren't storing a large
>>>> amount of data really and network latency appears fine.
>>>> Thanks for the help,
>>>> Nick
>>>> -----Original Message-----
>>>> From: "Nick Bailey" <nicholas.bai...@rackspace.com>
>>>> Sent: Tuesday, January 12, 2010 6:03pm
>>>> To: zookeeper-user@hadoop.apache.org
>>>> Subject: Re: Killing a zookeeper server
>>>> 12 was just to keep uniformity on our servers. Our clients are connecting
>>>> from the same 12 servers.  Easily modifiable and perhaps we should look
>>>> into
>>>> changing that.
>>>> The logs just seem to indicate that the servers that claim to have no
>>>> server running are continually attempting to elect a leader.  A sample is
>>>> provided below.  The initial exception is something we see regularly in our
>>>> logs and the debug and info lines following are simply repeating throughout
>>>> the log.
>>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN
>>>>  - Exception causing close of session 0x0 due to java.io.IOException: Read
>>>> error
>>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
>>>> - IOException stack trace
>>>> java.io.IOException: Read error
>>>>       at
>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
>>>>       at
>>>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:16
>>>> 2)
>>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO
>>>>  - closing session:0x0 NIOServerCnxn:
>>>> java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181
>>>> remote=/172.20.36.9:50367]
>>>> 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
>>>> - ignoring exception during input shutdown
>>>> java.net.SocketException: Transport endpoint is not connected
>>>>       at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>>>>       at
>>>> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>>>>       at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>>>>       at
>>>> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
>>>>       at
>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
>>>>       at
>>>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:16
>>>> 2)
>>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Adding vote
>>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9
>>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>> 30064826826
>>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Adding vote
>>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 11
>>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>> 30064826826
>>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Adding vote
>>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
>>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>> 30064826826
>>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Adding vote
>>>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread]
>>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
>>>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread]
>>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
>>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 8
>>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>> 30064826826
>>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Adding vote
>>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
>>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid:
>>>> 30064826826
>>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
>>>> INFO  - Adding vote
>>>> 2010-01-12 17:57:52,200 [WorkerReceiver Thread]
>>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
>>>> 2010-01-12 17:57:52,201 [WorkerReceiver Thread]
>>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
>>>> -----Original Message-----
>>>> From: "Patrick Hunt" <ph...@apache.org>
>>>> Sent: Tuesday, January 12, 2010 5:40pm
>>>> To: zookeeper-user@hadoop.apache.org, nicholas.bai...@rackspace.com
>>>> Subject: Re: Killing a zookeeper server
>>>> 12 servers? That's alot, if you dont' mind my asking why so many?
>>>> Typically we recommend 5 - that way you can have one down for maintenance
>>>> and still have a failure that doesn't bring down the cluster.
>>>> The "electing a leader" is probably the restarted machine attempting to
>>>> re-join the ensemble (it should join as a follower if you have a leader
>>>> already elected, given that it's xid is behind the existing leader.) Hard
>>>> to
>>>> tell though without the logs.
>>>> You might also be seeing the initLimit exceeded, is the data you are
>>>> storing in ZK large? Or perhaps network connectivity is slow?
>>>> 
>>>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clus
>>>> terOptions
>>>> again the logs would give some insight on this.
>>>> Patrick
>>>> Nick Bailey wrote:
>>>>> 
>>>>> We are running zookeeper 3.1.0
>>>>> 
>>>>> Recently we noticed the cpu usage on our machines becoming
>>>>> increasingly high and we believe the cause is
>>>>> 
>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-427
>>>>> 
>>>>> However our solution when we noticed the problem was to kill the
>>>>> zookeeper process and restart it.
>>>>> 
>>>>> After doing that though it looks like the newly restarted zookeeper
>>>>> server is continually attempting to elect a leader even though one
>>>>> already exists.
>>>>> 
>>>>> The process responses with 'imok' when asked, but the stat command
>>>>> returns 'ZooKeeperServer not running'.
>>>>> 
>>>>> I belive that killing the current leader should trigger all servers
>>>>> to do an election and solve the problem, but I'm not sure. Should
>>>>> that be the course of action in this situation?
>>>>> 
>>>>> Also we have 12 servers, but 5 are currently not running according to
>>>>> stat.  So I guess this isn't a problem unless we lose another one.
>>>>> We have plans to upgrade zookeeper to solve the cpu issue but haven't
>>>>> been able to do that yet.
>>>>> 
>>>>> Any help appreciated, Nick Bailey
>>>>> 
>>

Re: Killing a zookeeper server

Reply via email to