This past week, we had a zookeeper outage. All clients lost contact with the quorum. I am still trying to understand what happened. One of our servers ran out of disk space because between 02:23 and 02:50, zookeeper created almost 16 GB data.
ls -al /opt/eg/zookeeper/data/version-2 total 1662608 drwxr-xr-x 2 egadmin eggrp 61440 Jun 16 09:01 . drwxr-xr-x 3 egadmin eggrp 4096 Jun 5 07:21 .. -rw------- 1 egadmin eggrp 3 Jun 16 09:01 acceptedEpoch -rw------- 1 egadmin eggrp 3 Jun 16 09:01 currentEpoch -rw------- 1 egadmin eggrp 5986 Jun 16 09:01 snapshot.14d00018763 -rw------- 1 egadmin eggrp 522637450 Jun 16 02:23 snapshot.1a0060ab78 -rw------- 1 egadmin eggrp 523110346 Jun 16 02:24 snapshot.1a0060c200 -rw------- 1 egadmin eggrp 528639820 Jun 16 02:36 snapshot.1a0061c975 -rw------- 1 egadmin eggrp 128020480 Jun 16 02:50 snapshot.1a0062fd8c [root@jtcmpslegwap01 ~]# The zookeeper tree does not have much data in it. There are about 8 leaders, 1 pathcache with single strings, and one data element at a single zpath. What would cause something like this, creating so many large snapshots. What goes in the snapshots besides the data? Thank you, Curtis Cantrell The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
