[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Patrick Hunt (JIRA) Thu, 18 Mar 2010 16:14:53 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847156#action_12847156
 ]


Patrick Hunt commented on ZOOKEEPER-713:
----------------------------------------

> I guess JVM was swapping (as it was running out of memory) caused delays with 
> file transfer. What do you think? Is it possible?

We did see cases in your logs where the init seemed to take very short amount 
of time, and some cases where it took a very long time. Could be this, could be 
virtualization, intermittent network problems? ... any sound possible.

> One more question - Is it normal that zookeeper consumed 0.5G of memory 
> handling such small snapshot?

If you look at that troubleshooting page you'll see a link to the latency 
overview http://bit.ly/4ekN8G Notice this used a heap size of 512m with 
1.6.0_05 jvm. I was creating a large number of znodes and watches and it worked 
fine - with 20 client case I'm creating 200k znodes (20mb) and 1 million 
watches.

I don't know what might be taking the memory in your case (not knowing the use 
cases, znode count, avg size, etc...) but you can get more insight using 
something like visualvm or jconsole. Try attaching to the running VM and take a 
look.

> I will consider building newer java deb package myself but I would rather 
> treat is a last resort. Do you really think newer java version could help?

I wouldn't try that as a first option. It's something we noticed and wanted to 
mention in case it was easy to try.

Do take a look at the troubleshooting page - adding some monitoring of the jvm 
might help to provide insight. Also monitoring of the parameters that ZooKeeper 
itself makes avail through the command port and JMX. Also some of the other 
issues there have been faced by more than one user and might be helpful.



> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
> node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
> node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, 
> node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, 
> zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am 
> attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in 
> zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader 
> election phase.
> Calling stat command on any machine in the cluster resulted in 
> 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all 
> version-2 directories on all nodes and then cluster started up without 
> problems.
> Is it possible that snapshot/log data got corrupted in a way which made 
> cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for 
> locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened 
> at 15:03.
> At some point later we experimented with deleting version-2 directory so I 
> would not look at following restart because they can be misleading due to our 
> actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 
> 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any 
> downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Reply via email to