Re: Killing a zookeeper server

2010-01-25 Thread Patrick Hunt
1) Capture the logs from all 5 servers 2) give the config for the down server, also indicate that it's server id is. 3) if possible it would be interesting to see the netstat information from 2 of the servers - the one that's down and one or more of the others. Patrick Jean-Daniel Cryans

Re: Killing a zookeeper server

2010-01-25 Thread Jean-Daniel Cryans
Everything is here http://people.apache.org/~jdcryans/zk_election_bug.tar.gz The server we are trying to start is sv4borg222 (myid is 2) and we started it around 10:03:21 Thx! J-D On Mon, Jan 25, 2010 at 10:49 AM, Patrick Hunt ph...@apache.org wrote: 1) Capture the logs from all 5 servers 2)

Re: Killing a zookeeper server

2010-01-25 Thread Patrick Hunt
According to the log for 222 it can't open a connection to the election port (3888) for any of the other servers. This seems very unusual. Can you verify that ther's connectivity on that port btw 222 and all the other servers? Also, can you re-run the netstat with -a option? We can see the

Re: Killing a zookeeper server

2010-01-25 Thread Jean-Daniel Cryans
According to the log for 222 it can't open a connection to the election port (3888) for any of the other servers. This seems very unusual. Can you verify that ther's connectivity on that port btw 222 and all the other servers? jdcry...@sv4borg222:~$ telnet sv4borg224 3888 Trying

Re: Killing a zookeeper server

2010-01-25 Thread Patrick Hunt
JD, there's something _very_ unusual in your setup. Are you running official released ZooKeeper code or something else? Either there is a misconfiguration on the other servers (the configs for the other servers is exactly the same as 222 right?), or perhaps some patches to ZK codebase that

Re: Killing a zookeeper server

2010-01-25 Thread Jean-Daniel Cryans
Oh my god! You are right, we run an old dev version of 3.2.0: zookeeper-r785019-hbase-1329.jar This was what we shipped HBase trunk with last summer... This quorum has an uptime of more than 6 months! Well I guess that explains it, I thought we restarted it since then during our HBase upgrades

Re: Killing a zookeeper server

2010-01-25 Thread Ted Dunning
I love it that going 6 months to first anomaly is how you can tell you have a *broken* version of ZK. I wish that other software were so broken. On Mon, Jan 25, 2010 at 2:05 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: This was what we shipped HBase trunk with last summer... This quorum

Re: Killing a zookeeper server

2010-01-14 Thread Patrick Hunt
] (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new message. -Original Message- From: Patrick Hunt ph...@apache.org Sent: Tuesday, January 12, 2010 5:40pm To: zookeeper-user@hadoop.apache.org, nicholas.bai...@rackspace.com Subject: Re: Killing a zookeeper server 12 servers? That's alot

Re: Killing a zookeeper server

2010-01-14 Thread Patrick Hunt
-Original Message- From: Nick Bailey nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers. Easily

Re: Killing a zookeeper server

2010-01-13 Thread Flavio Junqueira
Hi Nick, Your assessment sounds correct, the issue seems to be caused by the bug described in ZOOKEEPER-427. Can't you upgrade to a newer release? Killing the leader should do it, but the bug will still be there, so I recommend upgrading. Thanks, -Flavio On Jan 12, 2010, at 10:52 PM, Nick

Re: Killing a zookeeper server

2010-01-13 Thread Nick Bailey
@hadoop.apache.org, nicholas.bai...@rackspace.com Subject: Re: Killing a zookeeper server 12 servers? That's alot, if you dont' mind my asking why so many? Typically we recommend 5 - that way you can have one down for maintenance and still have a failure that doesn't bring down the cluster. The electing

Re: Killing a zookeeper server

2010-01-13 Thread Adam Rosien
Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers.  Easily modifiable and perhaps we should look into changing that. The logs just seem to indicate that the servers that claim to have no server running

Re: Killing a zookeeper server

2010-01-13 Thread Mahadev Konar
amount of data really and network latency appears fine. Thanks for the help, Nick -Original Message- From: Nick Bailey nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just

Re: Killing a zookeeper server

2010-01-13 Thread Adam Rosien
nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers.  Easily modifiable and perhaps we should look

Re: Killing a zookeeper server

2010-01-13 Thread Nick Bailey
a large amount of data really and network latency appears fine. Thanks for the help, Nick -Original Message- From: Nick Bailey nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just

Re: Killing a zookeeper server

2010-01-12 Thread Patrick Hunt
12 servers? That's alot, if you dont' mind my asking why so many? Typically we recommend 5 - that way you can have one down for maintenance and still have a failure that doesn't bring down the cluster. The electing a leader is probably the restarted machine attempting to re-join the ensemble

Re: Killing a zookeeper server

2010-01-12 Thread Adam Rosien
I have a related question: what's the behavior of a cluster of 3 when one is down? I've tried it and a leader is elected, but are there any other caveats for this situation? .. Adam On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt ph...@apache.org wrote: 12 servers? That's alot, if you dont' mind

Re: Killing a zookeeper server

2010-01-12 Thread Henry Robinson
Hi Adam - As long as a quorum of servers is running, ZK will be live. With majority quorums, 2/3 is enough to keep going. In general, if fewer than half your nodes have failed, ZK will keep on keeping on. The main concern with a cluster of 2/3 machines is that a single further failure will bring

Re: Killing a zookeeper server

2010-01-12 Thread Nick Bailey
@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers. Easily modifiable and perhaps we should look into changing that. The logs just seem to indicate that the servers that claim to have no server

Re: Killing a zookeeper server

2010-01-12 Thread Adam Rosien
Doh - that makes total sense. For whatever reason I thought with 2 servers you couldn't get a majority :P On Tue, Jan 12, 2010 at 3:17 PM, Henry Robinson he...@cloudera.com wrote: Hi Adam - As long as a quorum of servers is running, ZK will be live. With majority quorums, 2/3 is enough to

Re: Killing a zookeeper server

2010-01-12 Thread Patrick Hunt
-Original Message- From: Nick Bailey nicholas.bai...@rackspace.com Sent: Tuesday, January 12, 2010 6:03pm To: zookeeper-user@hadoop.apache.org Subject: Re: Killing a zookeeper server 12 was just to keep uniformity on our servers. Our clients are connecting from the same 12 servers