Re: Killing a zookeeper server

2010-01-25 Thread Jean-Daniel Cryans
Everything is here http://people.apache.org/~jdcryans/zk_election_bug.tar.gz

The server we are trying to start is sv4borg222 (myid is 2) and we
started it around 10:03:21

Thx!

J-D

On Mon, Jan 25, 2010 at 10:49 AM, Patrick Hunt ph...@apache.org wrote:
 1) Capture the logs from all 5 servers
 2) give the config for the down server, also indicate that it's server id
 is.
 3) if possible it would be interesting to see the netstat information from 2
 of the servers - the one that's down and one or more of the others.

 Patrick

 Jean-Daniel Cryans wrote:

 I believe we've just hit the same problem with zk-3.2.1

 For some reason a machine crashed and it was part of our quorum of 5
 servers. When we try to restart it it this does this (I replaced
 hostname and IP):

 2010-01-25 10:25:06,469 WARN
 org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open
 channel to 1 at election address somehost1/someip1:3888
 java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
        at
 org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
        at
 org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356)
        at
 org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603)
        at
 org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)

 It has been like that for almost 20 minutes now, trying every other
 server in the quorum on different channels. ruok says imok but all
 other commands say that ZK server isn't running. I don't believe that
 3.2.2 will help unless ZK-547 does more than it seems to.

 Any else I should look at?

 Thx!

 J-D

 On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey ni...@mailtrust.com wrote:

 So the solution for us was to just nuke zookeeper and restart everywhere.
  We will also be upgrading soon as well.

 To answer your question, yes I believe all the servers were running
 normally
 except for the fact that they were experiencing high CPU usage.  As we
 began
 to see some CPU alerts I started restarting some of the servers.

 It was then that we noticed that they were not actually running according
 to
 'stat'.

 I still have the log from one server with a debug level and the rest with
 a
 warn level. If you would like to see any of these and analyze them just
 let
 me know.

 Thanks for the help,
 Nick Bailey




Re: Killing a zookeeper server

2010-01-25 Thread Patrick Hunt
According to the log for 222 it can't open a connection to the election 
port (3888) for any of the other servers. This seems very unusual. Can 
you verify that ther's connectivity on that port btw 222 and all the 
other servers?


Also, can you re-run the netstat with -a option? We can see the listen 
sockets that way (omitted by netstat by default). It would be great if 
you could send the netstat for all 5 servers.


Thanks,

Patrick

Jean-Daniel Cryans wrote:

Everything is here http://people.apache.org/~jdcryans/zk_election_bug.tar.gz

The server we are trying to start is sv4borg222 (myid is 2) and we
started it around 10:03:21

Thx!

J-D

On Mon, Jan 25, 2010 at 10:49 AM, Patrick Hunt ph...@apache.org wrote:

1) Capture the logs from all 5 servers
2) give the config for the down server, also indicate that it's server id
is.
3) if possible it would be interesting to see the netstat information from 2
of the servers - the one that's down and one or more of the others.

Patrick

Jean-Daniel Cryans wrote:

I believe we've just hit the same problem with zk-3.2.1

For some reason a machine crashed and it was part of our quorum of 5
servers. When we try to restart it it this does this (I replaced
hostname and IP):

2010-01-25 10:25:06,469 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open
channel to 1 at election address somehost1/someip1:3888
java.net.ConnectException: Connection refused
   at sun.nio.ch.Net.connect(Native Method)
   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
   at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
   at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
   at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356)
   at
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603)
   at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)

It has been like that for almost 20 minutes now, trying every other
server in the quorum on different channels. ruok says imok but all
other commands say that ZK server isn't running. I don't believe that
3.2.2 will help unless ZK-547 does more than it seems to.

Any else I should look at?

Thx!

J-D

On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey ni...@mailtrust.com wrote:

So the solution for us was to just nuke zookeeper and restart everywhere.
 We will also be upgrading soon as well.

To answer your question, yes I believe all the servers were running
normally
except for the fact that they were experiencing high CPU usage.  As we
began
to see some CPU alerts I started restarting some of the servers.

It was then that we noticed that they were not actually running according
to
'stat'.

I still have the log from one server with a debug level and the rest with
a
warn level. If you would like to see any of these and analyze them just
let
me know.

Thanks for the help,
Nick Bailey



Re: Killing a zookeeper server

2010-01-25 Thread Jean-Daniel Cryans
 According to the log for 222 it can't open a connection to the election port
 (3888) for any of the other servers. This seems very unusual. Can you verify
 that ther's connectivity on that port btw 222 and all the other servers?

jdcry...@sv4borg222:~$ telnet sv4borg224 3888
Trying 10.10.20.224...
telnet: Unable to connect to remote host: Connection refused
jdcry...@sv4borg222:~$ telnet sv4borg224 2888
Trying 10.10.20.224...
Connected to sv4borg224.
Escape character is '^]'.

 Also, can you re-run the netstat with -a option? We can see the listen
 sockets that way (omitted by netstat by default). It would be great if you
 could send the netstat for all 5 servers.

I updated the tar.gz with the 5 netstat -anp

Thx!

J-D


 Thanks,

 Patrick

 Jean-Daniel Cryans wrote:

 Everything is here
 http://people.apache.org/~jdcryans/zk_election_bug.tar.gz

 The server we are trying to start is sv4borg222 (myid is 2) and we
 started it around 10:03:21

 Thx!

 J-D



Re: Killing a zookeeper server

2010-01-25 Thread Patrick Hunt
JD, there's something _very_ unusual in your setup. Are you running 
official released ZooKeeper code or something else?


Either there is a misconfiguration on the other servers (the configs for 
the other servers is exactly the same as 222 right?), or perhaps some 
patches to ZK codebase that went awry?


See the attached file zk_ports.txt. This is a summary of the netstat 
-a that you sent. Notice in particular that UDP sockets are open for 
port 2888! This should not happen in the default ZK configuration case.


By default we only use tcp connections between servers (quorum  
election). There is a electionAlg option that allows users to turn off 
the TCP based fast leader election and go with a UDP based, but I don't 
see that in the config you provided for 222. (as I said, assuming you 
are not setting this option on the other servers either, correct?).



Mahadev and I do remember that there was a bug in the 3.2 branch prior 
to 3.2 ever being released that caused us to use non-FLE (so UDP based) 
election by default, however we fixed that before 3.2.0 ever shipped (it 
was a bug in our config processing code) and it was never exposed in an 
official release. Perhaps you have picked up some code prior to that?


Patrick

Jean-Daniel Cryans wrote:

According to the log for 222 it can't open a connection to the election port
(3888) for any of the other servers. This seems very unusual. Can you verify
that ther's connectivity on that port btw 222 and all the other servers?


jdcry...@sv4borg222:~$ telnet sv4borg224 3888
Trying 10.10.20.224...
telnet: Unable to connect to remote host: Connection refused
jdcry...@sv4borg222:~$ telnet sv4borg224 2888
Trying 10.10.20.224...
Connected to sv4borg224.
Escape character is '^]'.


Also, can you re-run the netstat with -a option? We can see the listen
sockets that way (omitted by netstat by default). It would be great if you
could send the netstat for all 5 servers.


I updated the tar.gz with the 5 netstat -anp

Thx!

J-D


Thanks,

Patrick

Jean-Daniel Cryans wrote:

Everything is here
http://people.apache.org/~jdcryans/zk_election_bug.tar.gz

The server we are trying to start is sv4borg222 (myid is 2) and we
started it around 10:03:21

Thx!

J-D

tcp6   0  0 10.10.20.221:34865  10.10.20.224:2888   ESTABLISHED 
14682/java  
udp6   0  0 :::2888 :::*
14682/java  


tcp6   0  0 :::3888 :::*LISTEN  
4092/java   
unix  2  [ ] STREAM CONNECTED 721588877 7642/java   


tcp6   0  0 10.10.20.223:42518  10.10.20.224:2888   ESTABLISHED 
2704/java   
udp6   0  0 :::2888 :::*
2704/java   


tcp6   0  0 :::2888 :::*LISTEN  
31052/java  
tcp6   0  0 10.10.20.224:2888   10.10.20.223:42518  ESTABLISHED 
31052/java  
tcp6   0  0 10.10.20.224:2888   10.10.20.225:51459  ESTABLISHED 
31052/java  
tcp6   0  0 10.10.20.224:2888   10.10.20.221:34865  ESTABLISHED 
31052/java  
udp6   0  0 :::2888 :::*
31052/java  


tcp6   0  0 10.10.20.225:51459  10.10.20.224:2888   ESTABLISHED 
19545/java  
udp6   0  0 :::2888 :::*
19545/java  


Re: Killing a zookeeper server

2010-01-14 Thread Patrick Hunt
/browse/ZOOKEEPER-427'.

Perhaps that is causing the timeout in responding? Also
to answer your other questions Patrick, we aren't
storing a large amount of data really and network
latency appears fine. Thanks for the help, Nick 
-Original Message- From: Nick Bailey

nicholas.bai...@rackspace.com Sent: Tuesday, January
12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org 
Subject: Re: Killing a zookeeper server 12 was just to

keep uniformity on our servers. Our clients are
connecting from the same 12 servers.  Easily modifiable
and perhaps we should look into changing that. The logs
just seem to indicate that the servers that claim to
have no server running are continually attempting to
elect a leader.  A sample is provided below.  The
initial exception is something we see regularly in our 
logs and the debug and info lines following are simply

repeating throughout the log. 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN -
Exception causing close of session 0x0 due to
java.io.IOException: Read error 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG -
IOException stack trace java.io.IOException: Read error
 at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
 at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:

 16 2) 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO -
closing session:0x0 NIOServerCnxn: 
java.nio.channels.SocketChannel[connected
local=/172.20.36.9:2181 remote=/172.20.36.9:50367] 
2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181]

(NIOServerCnxn) DEBUG - ignoring exception during input
shutdown java.net.SocketException: Transport endpoint
is not connected at
sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at
 
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
 at 
sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
 at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
 at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
 at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:

 16 2) 2010-01-12 17:55:52,181
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  -
Adding vote 2010-01-12 17:55:52,181
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9
 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO  - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 11 2010-01-12
17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO  - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 12 2010-01-12
17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:56:52,190 [WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -

Receive new message. 2010-01-12 17:56:52,190
[WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -

Receive new message. 2010-01-12 17:56:52,191
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 8
 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO  - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 12 2010-01-12
17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 
2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:57:52,200 [WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -

Receive new message. 2010-01-12 17:57:52,201
[WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -

Receive new message. -Original Message- From:
Patrick Hunt ph...@apache.org Sent: Tuesday,
January 12, 2010 5:40pm To:
zookeeper-user@hadoop.apache.org,
nicholas.bai...@rackspace.com Subject: Re: Killing a
zookeeper server 12 servers? That's alot, if you dont'
mind my asking why so many? Typically we recommend 5 -
that way you can have one down

Re: Killing a zookeeper server

2010-01-14 Thread Patrick Hunt
 increase to 6. It appears that all of
our servers that seem to be acting normally are
experiencing the cpu issue I mentioned earlier 
'https://issues.apache.org/jira/browse/ZOOKEEPER-427'.

Perhaps that is causing the timeout in responding? Also
to answer your other questions Patrick, we aren't
storing a large amount of data really and network
latency appears fine. Thanks for the help, Nick -Original 
Message- From: Nick Bailey

nicholas.bai...@rackspace.com Sent: Tuesday, January
12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: 
Re: Killing a zookeeper server 12 was just to

keep uniformity on our servers. Our clients are
connecting from the same 12 servers.  Easily modifiable
and perhaps we should look into changing that. The logs
just seem to indicate that the servers that claim to
have no server running are continually attempting to
elect a leader.  A sample is provided below.  The
initial exception is something we see regularly in our logs and 
the debug and info lines following are simply

repeating throughout the log. 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN -
Exception causing close of session 0x0 due to
java.io.IOException: Read error 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG -
IOException stack trace java.io.IOException: Read error
 at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295) 

 at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java: 


 16 2) 2010-01-12 17:55:02,269
[NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO -
closing session:0x0 NIOServerCnxn: 
java.nio.channels.SocketChannel[connected
local=/172.20.36.9:2181 remote=/172.20.36.9:50367] 2010-01-12 
17:55:02,270 [NIOServerCxn.Factory:2181]

(NIOServerCnxn) DEBUG - ignoring exception during input
shutdown java.net.SocketException: Transport endpoint
is not connected at
sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at
 
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) 


 at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
 at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767) 

 at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421) 

 at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java: 


 16 2) 2010-01-12 17:55:52,181
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  -
Adding vote 2010-01-12 17:55:52,181
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9
 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 2010-01-12 
17:55:52,181 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO  - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 11 2010-01-12
17:55:52,182 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 2010-01-12 
17:55:52,182 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO  - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 12 2010-01-12
17:55:52,183 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 2010-01-12 
17:55:52,183 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:56:52,190 [WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -

Receive new message. 2010-01-12 17:56:52,190
[WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -

Receive new message. 2010-01-12 17:56:52,191
[QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  -
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 8
 2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 2010-01-12 
17:56:52,191 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) INFO  - Notification: 3,
30064826826, 1, 9, LOOKING, LOOKING, 12 2010-01-12
17:56:52,192 [QuorumPeer:/0.0.0.0:2181]
(FastLeaderElection) DEBUG - id: 3, proposed id: 3,
zxid: 30064826826, proposed zxid: 30064826826 2010-01-12 
17:56:52,192 [QuorumPeer:/0.0.0.0:2181]

(FastLeaderElection) INFO  - Adding vote 2010-01-12
17:57:52,200 [WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -

Receive new message. 2010-01-12 17:57:52,201
[WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG -

Receive new message. -Original Message- From:
Patrick Hunt ph...@apache.org Sent: Tuesday,
January 12, 2010 5:40pm To:
zookeeper-user@hadoop.apache.org,
nicholas.bai

Re: Killing a zookeeper server

2010-01-13 Thread Flavio Junqueira
Hi Nick, Your assessment sounds correct, the issue seems to be caused  
by the bug described in ZOOKEEPER-427. Can't you upgrade to a newer  
release? Killing the leader should do it, but the bug will still be  
there, so I recommend upgrading.


Thanks,
-Flavio

On Jan 12, 2010, at 10:52 PM, Nick Bailey wrote:


We are running zookeeper 3.1.0

Recently we noticed the cpu usage on our machines becoming  
increasingly high and we believe the cause is


https://issues.apache.org/jira/browse/ZOOKEEPER-427

However our solution when we noticed the problem was to kill the  
zookeeper process and restart it.


After doing that though it looks like the newly restarted zookeeper  
server is continually attempting to elect a leader even though one  
already exists.


The process responses with 'imok' when asked, but the stat command  
returns 'ZooKeeperServer not running'.


I belive that killing the current leader should trigger all servers  
to do an election and solve the problem, but I'm not sure. Should  
that be the course of action in this situation?


Also we have 12 servers, but 5 are currently not running according  
to stat.  So I guess this isn't a problem unless we lose another  
one.  We have plans to upgrade zookeeper to solve the cpu issue but  
haven't been able to do that yet.


Any help appreciated,
Nick Bailey





Re: Killing a zookeeper server

2010-01-13 Thread Nick Bailey
So the solution for us was to just nuke zookeeper and restart  
everywhere.  We will also be upgrading soon as well.


To answer your question, yes I believe all the servers were running  
normally except for the fact that they were experiencing high CPU  
usage.  As we began to see some CPU alerts I started restarting some  
of the servers.


It was then that we noticed that they were not actually running  
according to 'stat'.


I still have the log from one server with a debug level and the rest  
with a warn level. If you would like to see any of these and analyze  
them just let me know.


Thanks for the help,
Nick Bailey

On Jan 12, 2010, at 8:20 PM, Patrick Hunt ph...@apache.org wrote:


Nick Bailey wrote:
In my last email I failded to include a log line that may be  
revelent as well
2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181]  
(QuorumCnxManager) DEBUG - Queue size: 0
2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181]  
(FastLeaderElection) INFO  - Notification time out: 6400


Yes, that is significant/interesting. I believe this means that  
there is some problem with the election process (ie the server re- 
joining the ensemble). We have a backoff on these attempts, which  
matches your description below. We have fixed some election issues  
in recent versions (we introduced fault injection testing prior to  
the 3.2.1 release which found a few issues with election). I don't  
have them off hand - but I've asked Flavio to comment directly (he's  
in diff tz).


Can you provide a bit more background: prior to this issue, this  
particular server was running fine? You restarted it and then  
started seeing the issue? (rather than this being a new server I  
mean). What I'm getting at is that there shouldn't/couldn't be any  
networking/firewall type issue going on right?


Can you provide a full/more log? What I'd suggest is shut down this  
one server, clear the log4j log file, then restart it. Let the  
problem reproduce, then gzip the log4j log file and attach to your  
response. Ok?


Patrick

We see this line occur frequently and the timeout will graduatlly  
increase to 6. It appears that all of our servers that seem to  
be acting normally are experiencing the cpu issue I mentioned  
earlier 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'.  
Perhaps that is causing the timeout in responding?
Also to answer your other questions Patrick, we aren't storing a  
large amount of data really and network latency appears fine.

Thanks for the help,
Nick
-Original Message-
From: Nick Bailey nicholas.bai...@rackspace.com
Sent: Tuesday, January 12, 2010 6:03pm
To: zookeeper-user@hadoop.apache.org
Subject: Re: Killing a zookeeper server
12 was just to keep uniformity on our servers. Our clients are  
connecting from the same 12 servers.  Easily modifiable and perhaps  
we should look into changing that.
The logs just seem to indicate that the servers that claim to have  
no server running are continually attempting to elect a leader.  A  
sample is provided below.  The initial exception is something we  
see regularly in our logs and the debug and info lines following  
are simply repeating throughout the log.
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)  
WARN  - Exception causing close of session 0x0 due to  
java.io.IOException: Read error
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)  
DEBUG - IOException stack trace

java.io.IOException: Read error
   at org.apache.zookeeper.server.NIOServerCnxn.doIO 
(NIOServerCnxn.java:295)
   at org.apache.zookeeper.server.NIOServerCnxn$Factory.run 
(NIOServerCnxn.java:162)
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn)  
INFO  - closing session:0x0 NIOServerCnxn:  
java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181  
remote=/172.20.36.9:50367]
2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn)  
DEBUG - ignoring exception during input shutdown

java.net.SocketException: Transport endpoint is not connected
   at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
   at sun.nio.ch.SocketChannelImpl.shutdownInput 
(SocketChannelImpl.java:640)
   at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java: 
360)
   at org.apache.zookeeper.server.NIOServerCnxn.close 
(NIOServerCnxn.java:767)
   at org.apache.zookeeper.server.NIOServerCnxn.doIO 
(NIOServerCnxn.java:421)
   at org.apache.zookeeper.server.NIOServerCnxn$Factory.run 
(NIOServerCnxn.java:162)
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]  
(FastLeaderElection) INFO  - Adding vote
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]  
(FastLeaderElection) INFO  - Notification: 3, 30064826826, 1, 9,  
LOOKING, LOOKING, 9
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]  
(FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid:  
30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181]  
(FastLeaderElection) INFO

Re: Killing a zookeeper server

2010-01-13 Thread Adam Rosien
On a related note, it was initially confusing to me that the server
returned 'imok' when it wasn't part of the quorum. I realize the
internal checks are probably in separate areas of the code, but if
others feel similarly I could file an improvement in JIRA.

.. Adam

On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey ni...@mailtrust.com wrote:
 So the solution for us was to just nuke zookeeper and restart everywhere.
  We will also be upgrading soon as well.

 To answer your question, yes I believe all the servers were running normally
 except for the fact that they were experiencing high CPU usage.  As we began
 to see some CPU alerts I started restarting some of the servers.

 It was then that we noticed that they were not actually running according to
 'stat'.

 I still have the log from one server with a debug level and the rest with a
 warn level. If you would like to see any of these and analyze them just let
 me know.

 Thanks for the help,
 Nick Bailey

 On Jan 12, 2010, at 8:20 PM, Patrick Hunt ph...@apache.org wrote:

 Nick Bailey wrote:

 In my last email I failded to include a log line that may be revelent as
 well
 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager)
 DEBUG - Queue size: 0
 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
 INFO  - Notification time out: 6400

 Yes, that is significant/interesting. I believe this means that there is
 some problem with the election process (ie the server re-joining the
 ensemble). We have a backoff on these attempts, which matches your
 description below. We have fixed some election issues in recent versions (we
 introduced fault injection testing prior to the 3.2.1 release which found a
 few issues with election). I don't have them off hand - but I've asked
 Flavio to comment directly (he's in diff tz).

 Can you provide a bit more background: prior to this issue, this
 particular server was running fine? You restarted it and then started seeing
 the issue? (rather than this being a new server I mean). What I'm getting at
 is that there shouldn't/couldn't be any networking/firewall type issue going
 on right?

 Can you provide a full/more log? What I'd suggest is shut down this one
 server, clear the log4j log file, then restart it. Let the problem
 reproduce, then gzip the log4j log file and attach to your response. Ok?

 Patrick

 We see this line occur frequently and the timeout will graduatlly
 increase to 6. It appears that all of our servers that seem to be acting
 normally are experiencing the cpu issue I mentioned earlier
 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is
 causing the timeout in responding?
 Also to answer your other questions Patrick, we aren't storing a large
 amount of data really and network latency appears fine.
 Thanks for the help,
 Nick
 -Original Message-
 From: Nick Bailey nicholas.bai...@rackspace.com
 Sent: Tuesday, January 12, 2010 6:03pm
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Killing a zookeeper server
 12 was just to keep uniformity on our servers. Our clients are connecting
 from the same 12 servers.  Easily modifiable and perhaps we should look into
 changing that.
 The logs just seem to indicate that the servers that claim to have no
 server running are continually attempting to elect a leader.  A sample is
 provided below.  The initial exception is something we see regularly in our
 logs and the debug and info lines following are simply repeating throughout
 the log.
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN
  - Exception causing close of session 0x0 due to java.io.IOException: Read
 error
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
 - IOException stack trace
 java.io.IOException: Read error
       at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
       at
 org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO
  - closing session:0x0 NIOServerCnxn:
 java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181
 remote=/172.20.36.9:50367]
 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
 - ignoring exception during input shutdown
 java.net.SocketException: Transport endpoint is not connected
       at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
       at
 sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
       at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
       at
 org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
       at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
       at
 org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
 INFO  - Adding vote
 2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181

Re: Killing a zookeeper server

2010-01-13 Thread Mahadev Konar
Hi Adam,
  That seems fair to file as an improvement. Running 'stat' did return the
right stats right? Saying the servers werent able to elect a leader?

mahadev


On 1/13/10 11:52 AM, Adam Rosien a...@rosien.net wrote:

 On a related note, it was initially confusing to me that the server
 returned 'imok' when it wasn't part of the quorum. I realize the
 internal checks are probably in separate areas of the code, but if
 others feel similarly I could file an improvement in JIRA.
 
 .. Adam
 
 On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey ni...@mailtrust.com wrote:
 So the solution for us was to just nuke zookeeper and restart everywhere.
  We will also be upgrading soon as well.
 
 To answer your question, yes I believe all the servers were running normally
 except for the fact that they were experiencing high CPU usage.  As we began
 to see some CPU alerts I started restarting some of the servers.
 
 It was then that we noticed that they were not actually running according to
 'stat'.
 
 I still have the log from one server with a debug level and the rest with a
 warn level. If you would like to see any of these and analyze them just let
 me know.
 
 Thanks for the help,
 Nick Bailey
 
 On Jan 12, 2010, at 8:20 PM, Patrick Hunt ph...@apache.org wrote:
 
 Nick Bailey wrote:
 
 In my last email I failded to include a log line that may be revelent as
 well
 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager)
 DEBUG - Queue size: 0
 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
 INFO  - Notification time out: 6400
 
 Yes, that is significant/interesting. I believe this means that there is
 some problem with the election process (ie the server re-joining the
 ensemble). We have a backoff on these attempts, which matches your
 description below. We have fixed some election issues in recent versions (we
 introduced fault injection testing prior to the 3.2.1 release which found a
 few issues with election). I don't have them off hand - but I've asked
 Flavio to comment directly (he's in diff tz).
 
 Can you provide a bit more background: prior to this issue, this
 particular server was running fine? You restarted it and then started seeing
 the issue? (rather than this being a new server I mean). What I'm getting at
 is that there shouldn't/couldn't be any networking/firewall type issue going
 on right?
 
 Can you provide a full/more log? What I'd suggest is shut down this one
 server, clear the log4j log file, then restart it. Let the problem
 reproduce, then gzip the log4j log file and attach to your response. Ok?
 
 Patrick
 
 We see this line occur frequently and the timeout will graduatlly
 increase to 6. It appears that all of our servers that seem to be
 acting
 normally are experiencing the cpu issue I mentioned earlier
 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is
 causing the timeout in responding?
 Also to answer your other questions Patrick, we aren't storing a large
 amount of data really and network latency appears fine.
 Thanks for the help,
 Nick
 -Original Message-
 From: Nick Bailey nicholas.bai...@rackspace.com
 Sent: Tuesday, January 12, 2010 6:03pm
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Killing a zookeeper server
 12 was just to keep uniformity on our servers. Our clients are connecting
 from the same 12 servers.  Easily modifiable and perhaps we should look
 into
 changing that.
 The logs just seem to indicate that the servers that claim to have no
 server running are continually attempting to elect a leader.  A sample is
 provided below.  The initial exception is something we see regularly in our
 logs and the debug and info lines following are simply repeating throughout
 the log.
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN
  - Exception causing close of session 0x0 due to java.io.IOException: Read
 error
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
 - IOException stack trace
 java.io.IOException: Read error
       at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
       at
 org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:16
 2)
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO
  - closing session:0x0 NIOServerCnxn:
 java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181
 remote=/172.20.36.9:50367]
 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
 - ignoring exception during input shutdown
 java.net.SocketException: Transport endpoint is not connected
       at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
       at
 sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
       at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
       at
 org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
       at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421

Re: Killing a zookeeper server

2010-01-13 Thread Adam Rosien
Yes, leader election either failed or was in progress yet 'ruok'
returned 'imok'.

.. Adam

On Wed, Jan 13, 2010 at 1:08 PM, Mahadev Konar maha...@yahoo-inc.com wrote:
 Hi Adam,
  That seems fair to file as an improvement. Running 'stat' did return the
 right stats right? Saying the servers werent able to elect a leader?

 mahadev


 On 1/13/10 11:52 AM, Adam Rosien a...@rosien.net wrote:

 On a related note, it was initially confusing to me that the server
 returned 'imok' when it wasn't part of the quorum. I realize the
 internal checks are probably in separate areas of the code, but if
 others feel similarly I could file an improvement in JIRA.

 .. Adam

 On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey ni...@mailtrust.com wrote:
 So the solution for us was to just nuke zookeeper and restart everywhere.
  We will also be upgrading soon as well.

 To answer your question, yes I believe all the servers were running normally
 except for the fact that they were experiencing high CPU usage.  As we began
 to see some CPU alerts I started restarting some of the servers.

 It was then that we noticed that they were not actually running according to
 'stat'.

 I still have the log from one server with a debug level and the rest with a
 warn level. If you would like to see any of these and analyze them just let
 me know.

 Thanks for the help,
 Nick Bailey

 On Jan 12, 2010, at 8:20 PM, Patrick Hunt ph...@apache.org wrote:

 Nick Bailey wrote:

 In my last email I failded to include a log line that may be revelent as
 well
 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager)
 DEBUG - Queue size: 0
 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
 INFO  - Notification time out: 6400

 Yes, that is significant/interesting. I believe this means that there is
 some problem with the election process (ie the server re-joining the
 ensemble). We have a backoff on these attempts, which matches your
 description below. We have fixed some election issues in recent versions 
 (we
 introduced fault injection testing prior to the 3.2.1 release which found a
 few issues with election). I don't have them off hand - but I've asked
 Flavio to comment directly (he's in diff tz).

 Can you provide a bit more background: prior to this issue, this
 particular server was running fine? You restarted it and then started 
 seeing
 the issue? (rather than this being a new server I mean). What I'm getting 
 at
 is that there shouldn't/couldn't be any networking/firewall type issue 
 going
 on right?

 Can you provide a full/more log? What I'd suggest is shut down this one
 server, clear the log4j log file, then restart it. Let the problem
 reproduce, then gzip the log4j log file and attach to your response. Ok?

 Patrick

 We see this line occur frequently and the timeout will graduatlly
 increase to 6. It appears that all of our servers that seem to be
 acting
 normally are experiencing the cpu issue I mentioned earlier
 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is
 causing the timeout in responding?
 Also to answer your other questions Patrick, we aren't storing a large
 amount of data really and network latency appears fine.
 Thanks for the help,
 Nick
 -Original Message-
 From: Nick Bailey nicholas.bai...@rackspace.com
 Sent: Tuesday, January 12, 2010 6:03pm
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Killing a zookeeper server
 12 was just to keep uniformity on our servers. Our clients are connecting
 from the same 12 servers.  Easily modifiable and perhaps we should look
 into
 changing that.
 The logs just seem to indicate that the servers that claim to have no
 server running are continually attempting to elect a leader.  A sample is
 provided below.  The initial exception is something we see regularly in 
 our
 logs and the debug and info lines following are simply repeating 
 throughout
 the log.
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN
  - Exception causing close of session 0x0 due to java.io.IOException: Read
 error
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
 - IOException stack trace
 java.io.IOException: Read error
       at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
       at
 org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:16
 2)
 2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO
  - closing session:0x0 NIOServerCnxn:
 java.nio.channels.SocketChannel[connected local=/172.20.36.9:2181
 remote=/172.20.36.9:50367]
 2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG
 - ignoring exception during input shutdown
 java.net.SocketException: Transport endpoint is not connected
       at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
       at
 sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
       at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java

Re: Killing a zookeeper server

2010-01-13 Thread Nick Bailey
Additionally, in this case the stat command simply stated that the zookeeper 
server was not running, which is obviously in conflict with an 'imok' response. 
 In order to determine that the server was attempting to rejoin/elect a leader 
I needed to enable debug logging, restart the process, and check the log.

So from a user perspective some more info from the stat command in this case 
would be very helpful.

-Nick


On Wednesday, January 13, 2010 4:39pm, Mahadev Konar maha...@yahoo-inc.com 
said:

 HI Adam,
  The 'ruok' command at the server does not have much checks in place (infact
 it does not have any checks in place). It just gives you an idea that the
 server is really running. The 'stat' command is meant to analyze the
 server's state.
 
 
 So you should infact file an improvement wherein there are atleast some
 checks in place for 'ruok' command to return 'imok', so that the users know
 there servers are running healthy.
 
 
 Thanks
 mahadev
 
 
 On 1/13/10 1:29 PM, Adam Rosien a...@rosien.net wrote:
 
 Yes, leader election either failed or was in progress yet 'ruok'
 returned 'imok'.

 .. Adam

 On Wed, Jan 13, 2010 at 1:08 PM, Mahadev Konar maha...@yahoo-inc.com wrote:
 Hi Adam,
  That seems fair to file as an improvement. Running 'stat' did return the
 right stats right? Saying the servers werent able to elect a leader?

 mahadev


 On 1/13/10 11:52 AM, Adam Rosien a...@rosien.net wrote:

 On a related note, it was initially confusing to me that the server
 returned 'imok' when it wasn't part of the quorum. I realize the
 internal checks are probably in separate areas of the code, but if
 others feel similarly I could file an improvement in JIRA.

 .. Adam

 On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey ni...@mailtrust.com wrote:
 So the solution for us was to just nuke zookeeper and restart everywhere.
  We will also be upgrading soon as well.

 To answer your question, yes I believe all the servers were running
 normally
 except for the fact that they were experiencing high CPU usage.  As we
 began
 to see some CPU alerts I started restarting some of the servers.

 It was then that we noticed that they were not actually running according
 to
 'stat'.

 I still have the log from one server with a debug level and the rest with 
 a
 warn level. If you would like to see any of these and analyze them just 
 let
 me know.

 Thanks for the help,
 Nick Bailey

 On Jan 12, 2010, at 8:20 PM, Patrick Hunt ph...@apache.org wrote:

 Nick Bailey wrote:

 In my last email I failded to include a log line that may be revelent as
 well
 2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager)
 DEBUG - Queue size: 0
 2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection)
 INFO  - Notification time out: 6400

 Yes, that is significant/interesting. I believe this means that there is
 some problem with the election process (ie the server re-joining the
 ensemble). We have a backoff on these attempts, which matches your
 description below. We have fixed some election issues in recent versions
 (we
 introduced fault injection testing prior to the 3.2.1 release which found
 a
 few issues with election). I don't have them off hand - but I've asked
 Flavio to comment directly (he's in diff tz).

 Can you provide a bit more background: prior to this issue, this
 particular server was running fine? You restarted it and then started
 seeing
 the issue? (rather than this being a new server I mean). What I'm getting
 at
 is that there shouldn't/couldn't be any networking/firewall type issue
 going
 on right?

 Can you provide a full/more log? What I'd suggest is shut down this one
 server, clear the log4j log file, then restart it. Let the problem
 reproduce, then gzip the log4j log file and attach to your response. Ok?

 Patrick

 We see this line occur frequently and the timeout will graduatlly
 increase to 6. It appears that all of our servers that seem to be
 acting
 normally are experiencing the cpu issue I mentioned earlier
 'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is
 causing the timeout in responding?
 Also to answer your other questions Patrick, we aren't storing a large
 amount of data really and network latency appears fine.
 Thanks for the help,
 Nick
 -Original Message-
 From: Nick Bailey nicholas.bai...@rackspace.com
 Sent: Tuesday, January 12, 2010 6:03pm
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: Killing a zookeeper server
 12 was just to keep uniformity on our servers. Our clients are 
 connecting
 from the same 12 servers.  Easily modifiable and perhaps we should
 look
 into
 changing that.
 The logs just seem to indicate that the servers that claim to have no
 server running are continually attempting to elect a leader.  A sample
 is
 provided below.  The initial exception is something we see regularly
 in
 our
 logs and the debug and info lines following are simply repeating
 throughout
 the log.
 2010-01-12 17:55:02,269

Killing a zookeeper server

2010-01-12 Thread Nick Bailey
We are running zookeeper 3.1.0

Recently we noticed the cpu usage on our machines becoming increasingly high 
and we believe the cause is 

https://issues.apache.org/jira/browse/ZOOKEEPER-427

However our solution when we noticed the problem was to kill the zookeeper 
process and restart it.

After doing that though it looks like the newly restarted zookeeper server is 
continually attempting to elect a leader even though one already exists.

The process responses with 'imok' when asked, but the stat command returns 
'ZooKeeperServer not running'.

I belive that killing the current leader should trigger all servers to do an 
election and solve the problem, but I'm not sure. Should that be the course of 
action in this situation?

Also we have 12 servers, but 5 are currently not running according to stat.  So 
I guess this isn't a problem unless we lose another one.  We have plans to 
upgrade zookeeper to solve the cpu issue but haven't been able to do that yet.

Any help appreciated,
Nick Bailey



Re: Killing a zookeeper server

2010-01-12 Thread Patrick Hunt
12 servers? That's alot, if you dont' mind my asking why so many? 
Typically we recommend 5 - that way you can have one down for 
maintenance and still have a failure that doesn't bring down the cluster.


The electing a leader is probably the restarted machine attempting to 
re-join the ensemble (it should join as a follower if you have a leader 
already elected, given that it's xid is behind the existing leader.) 
Hard to tell though without the logs.


You might also be seeing the initLimit exceeded, is the data you are 
storing in ZK large? Or perhaps network connectivity is slow?

http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
again the logs would give some insight on this.


Patrick

Nick Bailey wrote:

We are running zookeeper 3.1.0

Recently we noticed the cpu usage on our machines becoming
increasingly high and we believe the cause is

https://issues.apache.org/jira/browse/ZOOKEEPER-427

However our solution when we noticed the problem was to kill the
zookeeper process and restart it.

After doing that though it looks like the newly restarted zookeeper
server is continually attempting to elect a leader even though one
already exists.

The process responses with 'imok' when asked, but the stat command
returns 'ZooKeeperServer not running'.

I belive that killing the current leader should trigger all servers
to do an election and solve the problem, but I'm not sure. Should
that be the course of action in this situation?

Also we have 12 servers, but 5 are currently not running according to
stat.  So I guess this isn't a problem unless we lose another one.
We have plans to upgrade zookeeper to solve the cpu issue but haven't
been able to do that yet.

Any help appreciated, Nick Bailey



Re: Killing a zookeeper server

2010-01-12 Thread Adam Rosien
I have a related question: what's the behavior of a cluster of 3 when
one is down? I've tried it and a leader is elected, but are there any
other caveats for this situation?

.. Adam

On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt ph...@apache.org wrote:
 12 servers? That's alot, if you dont' mind my asking why so many? Typically
 we recommend 5 - that way you can have one down for maintenance and still
 have a failure that doesn't bring down the cluster.

 The electing a leader is probably the restarted machine attempting to
 re-join the ensemble (it should join as a follower if you have a leader
 already elected, given that it's xid is behind the existing leader.) Hard to
 tell though without the logs.

 You might also be seeing the initLimit exceeded, is the data you are storing
 in ZK large? Or perhaps network connectivity is slow?
 http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
 again the logs would give some insight on this.


 Patrick

 Nick Bailey wrote:

 We are running zookeeper 3.1.0

 Recently we noticed the cpu usage on our machines becoming
 increasingly high and we believe the cause is

 https://issues.apache.org/jira/browse/ZOOKEEPER-427

 However our solution when we noticed the problem was to kill the
 zookeeper process and restart it.

 After doing that though it looks like the newly restarted zookeeper
 server is continually attempting to elect a leader even though one
 already exists.

 The process responses with 'imok' when asked, but the stat command
 returns 'ZooKeeperServer not running'.

 I belive that killing the current leader should trigger all servers
 to do an election and solve the problem, but I'm not sure. Should
 that be the course of action in this situation?

 Also we have 12 servers, but 5 are currently not running according to
 stat.  So I guess this isn't a problem unless we lose another one.
 We have plans to upgrade zookeeper to solve the cpu issue but haven't
 been able to do that yet.

 Any help appreciated, Nick Bailey




Re: Killing a zookeeper server

2010-01-12 Thread Nick Bailey
In my last email I failded to include a log line that may be revelent as well

2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager) DEBUG - 
Queue size: 0
2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification time out: 6400

We see this line occur frequently and the timeout will graduatlly increase to 
6. It appears that all of our servers that seem to be acting normally are 
experiencing the cpu issue I mentioned earlier 
'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is causing 
the timeout in responding?

Also to answer your other questions Patrick, we aren't storing a large amount 
of data really and network latency appears fine.

Thanks for the help,
Nick


-Original Message-
From: Nick Bailey nicholas.bai...@rackspace.com
Sent: Tuesday, January 12, 2010 6:03pm
To: zookeeper-user@hadoop.apache.org
Subject: Re: Killing a zookeeper server

12 was just to keep uniformity on our servers. Our clients are connecting from 
the same 12 servers.  Easily modifiable and perhaps we should look into 
changing that.


The logs just seem to indicate that the servers that claim to have no server 
running are continually attempting to elect a leader.  A sample is provided 
below.  The initial exception is something we see regularly in our logs and the 
debug and info lines following are simply repeating throughout the log.


2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN  - 
Exception causing close of session 0x0 due to java.io.IOException: Read error
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG - 
IOException stack trace
java.io.IOException: Read error
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO  - 
closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected 
local=/172.20.36.9:2181 remote=/172.20.36.9:50367]
2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG - 
ignoring exception during input shutdown
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at 
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)

2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - 
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 11
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - 
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - 
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote
2010-01-12 17:56:52,190 [WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
2010-01-12 17:56:52,190 [WorkerReceiver Thread] 
(FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message.
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 8
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - 
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote
2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - 
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:56:52,192 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote

Re: Killing a zookeeper server

2010-01-12 Thread Adam Rosien
Doh - that makes total sense. For whatever reason I thought with 2
servers you couldn't get a majority :P

On Tue, Jan 12, 2010 at 3:17 PM, Henry Robinson he...@cloudera.com wrote:
 Hi Adam -

 As long as a quorum of servers is running, ZK will be live. With majority
 quorums, 2/3 is enough to keep going. In general, if fewer than half your
 nodes have failed, ZK will keep on keeping on.

 The main concern with a cluster of 2/3 machines is that a single further
 failure will bring down the whole cluster.

 Henry

 2010/1/12 Adam Rosien a...@rosien.net

 I have a related question: what's the behavior of a cluster of 3 when
 one is down? I've tried it and a leader is elected, but are there any
 other caveats for this situation?

 .. Adam

 On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt ph...@apache.org wrote:
  12 servers? That's alot, if you dont' mind my asking why so many?
 Typically
  we recommend 5 - that way you can have one down for maintenance and still
  have a failure that doesn't bring down the cluster.
 
  The electing a leader is probably the restarted machine attempting to
  re-join the ensemble (it should join as a follower if you have a leader
  already elected, given that it's xid is behind the existing leader.) Hard
 to
  tell though without the logs.
 
  You might also be seeing the initLimit exceeded, is the data you are
 storing
  in ZK large? Or perhaps network connectivity is slow?
 
 http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
  again the logs would give some insight on this.
 
 
  Patrick
 
  Nick Bailey wrote:
 
  We are running zookeeper 3.1.0
 
  Recently we noticed the cpu usage on our machines becoming
  increasingly high and we believe the cause is
 
  https://issues.apache.org/jira/browse/ZOOKEEPER-427
 
  However our solution when we noticed the problem was to kill the
  zookeeper process and restart it.
 
  After doing that though it looks like the newly restarted zookeeper
  server is continually attempting to elect a leader even though one
  already exists.
 
  The process responses with 'imok' when asked, but the stat command
  returns 'ZooKeeperServer not running'.
 
  I belive that killing the current leader should trigger all servers
  to do an election and solve the problem, but I'm not sure. Should
  that be the course of action in this situation?
 
  Also we have 12 servers, but 5 are currently not running according to
  stat.  So I guess this isn't a problem unless we lose another one.
  We have plans to upgrade zookeeper to solve the cpu issue but haven't
  been able to do that yet.
 
  Any help appreciated, Nick Bailey
 
 




Re: Killing a zookeeper server

2010-01-12 Thread Patrick Hunt

Nick Bailey wrote:

In my last email I failded to include a log line that may be revelent as well

2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManager) DEBUG - 
Queue size: 0
2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification time out: 6400


Yes, that is significant/interesting. I believe this means that there is 
some problem with the election process (ie the server re-joining the 
ensemble). We have a backoff on these attempts, which matches your 
description below. We have fixed some election issues in recent versions 
(we introduced fault injection testing prior to the 3.2.1 release which 
found a few issues with election). I don't have them off hand - but I've 
asked Flavio to comment directly (he's in diff tz).


Can you provide a bit more background: prior to this issue, this 
particular server was running fine? You restarted it and then started 
seeing the issue? (rather than this being a new server I mean). What I'm 
getting at is that there shouldn't/couldn't be any networking/firewall 
type issue going on right?


Can you provide a full/more log? What I'd suggest is shut down this one 
server, clear the log4j log file, then restart it. Let the problem 
reproduce, then gzip the log4j log file and attach to your response. Ok?


Patrick



We see this line occur frequently and the timeout will graduatlly increase to 
6. It appears that all of our servers that seem to be acting normally are 
experiencing the cpu issue I mentioned earlier 
'https://issues.apache.org/jira/browse/ZOOKEEPER-427'. Perhaps that is causing 
the timeout in responding?

Also to answer your other questions Patrick, we aren't storing a large amount 
of data really and network latency appears fine.

Thanks for the help,
Nick


-Original Message-
From: Nick Bailey nicholas.bai...@rackspace.com
Sent: Tuesday, January 12, 2010 6:03pm
To: zookeeper-user@hadoop.apache.org
Subject: Re: Killing a zookeeper server

12 was just to keep uniformity on our servers. Our clients are connecting from 
the same 12 servers.  Easily modifiable and perhaps we should look into 
changing that.


The logs just seem to indicate that the servers that claim to have no server 
running are continually attempting to elect a leader.  A sample is provided 
below.  The initial exception is something we see regularly in our logs and the 
debug and info lines following are simply repeating throughout the log.


2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) WARN  - 
Exception causing close of session 0x0 due to java.io.IOException: Read error
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG - 
IOException stack trace
java.io.IOException: Read error
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:295)
at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)
2010-01-12 17:55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO  - 
closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected 
local=/172.20.36.9:2181 remote=/172.20.36.9:50367]
2010-01-12 17:55:02,270 [NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG - 
ignoring exception during input shutdown
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at 
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:767)
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:421)
at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)

2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - 
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 11
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - 
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Adding vote
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO  - 
Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - 
id: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826
2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181