Re: cluster fails to start - broken snapshot?

Benjamin Reed Thu, 18 Mar 2010 10:58:55 -0700

we have updated ZOOKEEPER-713 with much more detail, but the bottom lineis that the Invalid snapshot was caused by an OutOfMemoryError. thisturns out not be a problem since we recover using an older snapshot.there are other things that are happening that are the real causes ofthe problem. see the jira for details.


thanx
ben


On 03/18/2010 09:16 AM, Łukasz Osipiuk wrote:

Hi guys,

Today we experienced another problem with our zookeeper installation.
Due to large attachments I created jira issue for it, even though it
is rather question than bug report.

https://issues.apache.org/jira/browse/ZOOKEEPER-713

Description below:

Today we had major failure in our production environment. Machines in
zookeeper cluster gone wild and all clients got disconnected.
We tried to restart whole zookeeper cluster but cluster got stuck in
leader election phase.

Calling stat command on any machine in the cluster resulted in
'ZooKeeperServer not running' message
In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit.

We did not manage to make cluster work again with data. We deleted all
version-2 directories on all nodes and then cluster started up without
problems.
Is it possible that snapshot/log data got corrupted in a way which
made cluster unable to start?
Fortunately we could rebuild data we store in zookeeper as we use it
only for locks and most of nodes is ephemeral.

I am attaching contents of version-2 directory from all nodes and server logs.
Source problem occurred some time before 15. First cluster restart
happened at 15:03.
At some point later we experimented with deleting version-2 directory
so I would not look at following restart because they can be
misleading due to our actions.

I am also attaching zoo.cfg. Maybe something is wrong at this place.
As I know look into logs i see read timeout during initialization
phase after 20secs (initLimit=10, tickTime=2000).
Maybe all I have to do is increase one or other. which one? Are there
any downsides of increasing tickTime.

Best regards, Łukasz Osipiuk

PS. due to attachment size limit I used split. to untar use
cat nodeX-version-2.tgz-* |tar -xz

Re: cluster fails to start - broken snapshot?

Reply via email to