I would like to add some info on this.
This may not be very important, but there are subtle differences.
Two cases: 1. server hardware failure or kernel panic
2. zookeeper Java daemon process down
In former one, timeout will be based on the timeout argument in
Partially based on ZK heartbeat algorithm. It recognize server down in 2/3 of
then retries at every timeout. For example, if timeout is 9000 msec, it
first times out in 6 second, and retries every 9 seconds.
In latter case (Java process down), since socket connect immediately returns
refused connection, it can retry immediately.
On top of that,
- Hardware load balancer:
If an ensemble cluster is serviced with hardware load balancer,
zookeeper client will retry every 2 second since we only have one IP to try.
- DNS RR:
Make sure that "nscd" on your linux box is off since it is most likely that DNS
cache returns the same IP many times.
This is actually worse than above since ZK client will retry the same dead
server every 2 seconds for some time.
I think it is best not to use load balancer for ZK clients since ZK clients
will try next server immediately
if previous one fails for some reason (based on timeout above). And this is
especially true if your cluster works in
pseudo realtime environment where tickTime is set to very low.
On Nov 4, 2010, at 9:17 AM, Ted Dunning wrote:
> DNS round-robin works as well.
> On Wed, Nov 3, 2010 at 3:45 PM, Benjamin Reed <br...@yahoo-inc.com> wrote:
>> it would have to be a TCP based load balancer to work with ZooKeeper
>> clients, but other than that it should work really well. The clients will be
>> doing heart beats so the TCP connections will be long lived. The client
>> library does random connection load balancing anyway.
>> On 11/03/2010 12:19 PM, Luka Stojanovic wrote:
>>> What would be expected behavior if a three node cluster is put behind a
>>> balancer? It would ease deployment because all clients would be configured
>>> to target zookeeper.example.com regardless of actual cluster
>>> but I have impression that client-server connection is stateful and that
>>> jumping randomly from server to server could bring strange behavior.
>>> Luka Stojanovic
>>> Platform Engineering