Thanks Karol. We ended up doing similar thing as you mentioned. We restarted ZK on a different port with 2x heap size & quite large initLimit & syncLimit values and deleted unnecessary znodes. Snapshot size is back to 250m now.
Any recommendations for ZK data browser tools? CP On Sat, Apr 25, 2015 at 6:12 AM, Karol Dudzinski <[email protected]> wrote: > Hi CP, > > The JIRA is > https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZOOKEEPER-2141 > . > > Doesn't sound like the same thing as what you're facing. However, we also > had OOM errors which was what caused us to start digging through the > snapshots in detail. As far as I can tell, in your case the only option is > to bump up max heap size sufficiently to allow the server to come up and > then delete the rogue entries. One of the ZK devs may have some other > ideas. > > Karol > > > On 24 Apr 2015, at 22:28, CP Mishra <[email protected]> wrote: > > > > Karol, that's interesting. Can you send the Jira ticket, please? > > > > In our case, a rogue program added 300k entries via a service that > persists > > data in ZK and is meant for only a handful of entries. Now, we are > dealing > > with deleting these entries taking up > 3 GB. > > > > Thanks, > > CP > > > > On Fri, Apr 24, 2015 at 1:09 PM, Karol Dudzinski < > [email protected]> > > wrote: > > > >> Hi, > >> > >> Do you know if any of the services that use your ZK create ACLs that are > >> potentially unique and one-time-ish? I recently hit a similar problem > and > >> discovered that the DataTree has an ACL cache that never gets anything > >> removed from it. That was by far and away the largest memory consumer I > >> found when analysing the heap dump. If this is the case then you should > >> see lots of ACL objects on the heap. > >> > >> I filed a JIRA for this and keep meaning to submit a patch but sadly > >> haven't got round to it. As an interim solution, I wrote a tool which > uses > >> the DataTree class and the serialisation utils to purge this cache of > >> unused entries. I my case it shrank the snapshot from 500MB to 12MB! > The > >> time to write the snapshot went from 40 seconds to less than 1 second > as a > >> result. > >> > >> Thanks, > >> Karol > >> > >> > >>> On 24 Apr 2015, at 18:45, CP Mishra <[email protected]> wrote: > >>> > >>> Hi, > >>> > >>> I am running a 3 node ZK ensemble on 3 VMs (2 CPU, 32GB RAM) in the > test > >>> environment. Lately, I have been getting OutOfMemoryError on all three > ZK > >>> nodes. ZK has been configured with 6GB heap size. The same ZK ensemble > is > >>> shared between Kafka, HDFS HA and another custom service. > >>> > >>> I analyzed the heap dump and 5.8+ GB is being used by DataTree. I > don't > >>> have a purge policy in place and size of ZK data directory stands at > ~14 > >> GB > >>> now. There is enough space on the disk holding ZK data (20% used). > >>> > >>> As soon as I restart a ZK node, it grows to use all 6GB and starts Full > >> GC > >>> every 1-2 sec. In 3-5 minutes, it throws OOM: GC Overhead exceeded. > >>> > >>> I would appreciate any help in diagnosing the issue. > >>> > >>> Thanks, > >>> CP Mishra > >> >
