I ran into a very similar scenario. Disk filling up --> zookeeper crash --> topologies disappear. The lesson learnt was to isolate zookeeper from other processes which should have been done from the start. Here is the code which cleans up topologies - https://github.com/apache/storm/blob/v0.10.0/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L810
On Thu, Apr 14, 2016 at 10:42 PM, John Bush <[email protected]> wrote: > So we had a zookeeper outage the other day, that somehow ended up causing > Storm to delete all its topologies. I'm looking to see if this is > something anyone else has experienced, and whether or not a Storm upgrade > might address some of my concerns. > > Here is what I've figured out so far: > > Storm 0.10 version - two worker nodes, run runs nimbus > Kafka 0.8.2.1 - 3 nodes > Zookeeper 3.4.5 - 3 nodes > > Zookeeper and kafka clusters crashed, Storm jobs went into a whirl wind > failing, leaving turds in /tmp filling up disk. > Woke up in morning all topology jars missing, nowhere to be found. > Look at storm data in zookeeper, looks like everything is missing there. > Try to republish a job, nimbus picks it up starts it then decides job > shouldn't be here and kills it > Cleanout zookeeper data - no change > Cleanout localstate data - no change > shutdown storm node2, clean out localstate on node1 and zookeeper data > restart storm node1 > success! > > So I think the localstate also got corrupted. I'm not sure who exactly > got corrupted first, but it appears Storm started trusting the wrong source > for truth and decided all the jobs shouldn't be there. > > So anyone else ever run into this, thoughts ? > > -- Regards, Abhishek Agarwal
