Hi Adam, That seems fair to file as an improvement. Running 'stat' did return the right stats right? Saying the servers werent able to elect a leader?
mahadev On 1/13/10 11:52 AM, "Adam Rosien" <a...@rosien.net> wrote: > On a related note, it was initially confusing to me that the server > returned 'imok' when it wasn't part of the quorum. I realize the > internal checks are probably in separate areas of the code, but if > others feel similarly I could file an improvement in JIRA. > > .. Adam > > On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey <ni...@mailtrust.com> wrote: >> So the solution for us was to just nuke zookeeper and restart everywhere. >> We will also be upgrading soon as well. >> >> To answer your question, yes I believe all the servers were running normally >> except for the fact that they were experiencing high CPU usage. As we began >> to see some CPU alerts I started restarting some of the servers. >> >> It was then that we noticed that they were not actually running according to >> 'stat'. >> >> I still have the log from one server with a debug level and the rest with a >> warn level. If you would like to see any of these and analyze them just let >> me know. >> >> Thanks for the help, >> Nick Bailey >> >> On Jan 12, 2010, at 8:20 PM, Patrick Hunt <ph...@apache.org> wrote: >> >>> Nick Bailey wrote: >>>> >>>> In my last email I failded to include a log line that may be revelent as >>>> well >>>> 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager) >>>> DEBUG - Queue size: 0 >>>> 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Notification time out: 6400 >>> >>> Yes, that is significant/interesting. I believe this means that there is >>> some problem with the election process (ie the server re-joining the >>> ensemble). We have a backoff on these attempts, which matches your >>> description below. We have fixed some election issues in recent versions (we >>> introduced fault injection testing prior to the 3.2.1 release which found a >>> few issues with election). I don't have them off hand - but I've asked >>> Flavio to comment directly (he's in diff tz). >>> >>> Can you provide a bit more background: prior to this issue, this >>> particular server was running fine? You restarted it and then started seeing >>> the issue? (rather than this being a new server I mean). What I'm getting at >>> is that there shouldn't/couldn't be any networking/firewall type issue going >>> on right? >>> >>> Can you provide a full/more log? What I'd suggest is shut down this one >>> server, clear the log4j log file, then restart it. Let the problem >>> reproduce, then gzip the log4j log file and attach to your response. Ok? >>> >>> Patrick >>> >>>> We see this line occur frequently and the timeout will graduatlly >>>> increase to 60000. It appears that all of our servers that seem to be >>>> acting >>>> normally are experiencing the cpu issue I mentioned earlier >>>> 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is >>>> causing the timeout in responding? >>>> Also to answer your other questions Patrick, we aren't storing a large >>>> amount of data really and network latency appears fine. >>>> Thanks for the help, >>>> Nick >>>> -----Original Message----- >>>> From: "Nick Bailey" <nicholas.bai...@rackspace.com> >>>> Sent: Tuesday, January 12, 2010 6:03pm >>>> To: zookeeper-user@hadoop.apache.org >>>> Subject: Re: Killing a zookeeper server >>>> 12 was just to keep uniformity on our servers. Our clients are connecting >>>> from the same 12 servers. Easily modifiable and perhaps we should look >>>> into >>>> changing that. >>>> The logs just seem to indicate that the servers that claim to have no >>>> server running are continually attempting to elect a leader. A sample is >>>> provided below. The initial exception is something we see regularly in our >>>> logs and the debug and info lines following are simply repeating throughout >>>> the log. >>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN >>>> - Exception causing close of session 0x0 due to java.io.IOException: Read >>>> error >>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG >>>> - IOException stack trace >>>> java.io.IOException: Read error >>>> at >>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295) >>>> at >>>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:16 >>>> 2) >>>> 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO >>>> - closing session:0x0 NIOServerCnxn: >>>> java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181 >>>> remote=/172.20.36.9:50367] >>>> 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG >>>> - ignoring exception during input shutdown >>>> java.net.SocketException: Transport endpoint is not connected >>>> at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) >>>> at >>>> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) >>>> at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) >>>> at >>>> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767) >>>> at >>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421) >>>> at >>>> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:16 >>>> 2) >>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Adding vote >>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9 >>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: >>>> 30064826826 >>>> 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Adding vote >>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 11 >>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: >>>> 30064826826 >>>> 2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Adding vote >>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12 >>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: >>>> 30064826826 >>>> 2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Adding vote >>>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread] >>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message. >>>> 2010-01-12 17:56:52,190 [WorkerReceiver Thread] >>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message. >>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 8 >>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: >>>> 30064826826 >>>> 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Adding vote >>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12 >>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> DEBUG - id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: >>>> 30064826826 >>>> 2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) >>>> INFO - Adding vote >>>> 2010-01-12 17:57:52,200 [WorkerReceiver Thread] >>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message. >>>> 2010-01-12 17:57:52,201 [WorkerReceiver Thread] >>>> (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message. >>>> -----Original Message----- >>>> From: "Patrick Hunt" <ph...@apache.org> >>>> Sent: Tuesday, January 12, 2010 5:40pm >>>> To: zookeeper-user@hadoop.apache.org, nicholas.bai...@rackspace.com >>>> Subject: Re: Killing a zookeeper server >>>> 12 servers? That's alot, if you dont' mind my asking why so many? >>>> Typically we recommend 5 - that way you can have one down for maintenance >>>> and still have a failure that doesn't bring down the cluster. >>>> The "electing a leader" is probably the restarted machine attempting to >>>> re-join the ensemble (it should join as a follower if you have a leader >>>> already elected, given that it's xid is behind the existing leader.) Hard >>>> to >>>> tell though without the logs. >>>> You might also be seeing the initLimit exceeded, is the data you are >>>> storing in ZK large? Or perhaps network connectivity is slow? >>>> >>>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clus >>>> terOptions >>>> again the logs would give some insight on this. >>>> Patrick >>>> Nick Bailey wrote: >>>>> >>>>> We are running zookeeper 3.1.0 >>>>> >>>>> Recently we noticed the cpu usage on our machines becoming >>>>> increasingly high and we believe the cause is >>>>> >>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-427 >>>>> >>>>> However our solution when we noticed the problem was to kill the >>>>> zookeeper process and restart it. >>>>> >>>>> After doing that though it looks like the newly restarted zookeeper >>>>> server is continually attempting to elect a leader even though one >>>>> already exists. >>>>> >>>>> The process responses with 'imok' when asked, but the stat command >>>>> returns 'ZooKeeperServer not running'. >>>>> >>>>> I belive that killing the current leader should trigger all servers >>>>> to do an election and solve the problem, but I'm not sure. Should >>>>> that be the course of action in this situation? >>>>> >>>>> Also we have 12 servers, but 5 are currently not running according to >>>>> stat. So I guess this isn't a problem unless we lose another one. >>>>> We have plans to upgrade zookeeper to solve the cpu issue but haven't >>>>> been able to do that yet. >>>>> >>>>> Any help appreciated, Nick Bailey >>>>> >>