That error makes me think zookeeper isn't healthy. It isn't necessarily clear from your message whether you've verified ZK is totally ok but I'd at least look at the healthcheck on each ZK host ($ echo ruok | nc zookeeperhost 2181). We've run into that issue before when a ZK host runs out of disk space due to a poorly configured cleanup script <https://zookeeper.apache.org/doc/r3.4.9/zookeeperAdmin.html#sc_maintenance>. If it doesn't look like that's the case then I would at least browse the zookeeper logs to make sure everything looks good on that end before looking toward possible issues with Storm.
On Wed, Nov 23, 2016 at 4:07 AM uday bhaskar <[email protected]> wrote: > Hi, > > We have been running STORM for a few months in production. > > We started facing an issue with workers crashing all the time. > > *2016-11-23 11:56:51.818 o.a.s.util [ERROR] Halting process: ("Worker > died")* > *java.lang.RuntimeException: ("Worker died")* > * at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341) > [storm-core-1.0.0.jar:1.0.0]* > * at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.7.0.jar:?]* > * at > org.apache.storm.daemon.worker$fn__8831$fn__8832.invoke(worker.clj:762) > [storm-core-1.0.0.jar:1.0.0]* > * at > org.apache.storm.daemon.executor$mk_executor_data$fn__8046$fn__8047.invoke(executor.clj:271) > [storm-core-1.0.0.jar:1.0.0]* > * at org.apache.storm.util$async_loop$fn__554.invoke(util.clj:494) > [storm-core-1.0.0.jar:1.0.0]* > * at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]* > * at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]* > > Our suspicion is that its being caused because of the following error, > > java.lang.RuntimeException: java.lang.RuntimeException: > org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /partition_2/145665 > at > org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:448) > ~[storm-core-1.0.0.jar:1.0.0] > at > org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:414) > ~[storm-core-1.0.0.jar:1.0.0] > at > org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) > ~[storm-core-1.0.0.jar:1.0.0] > at > org.apache.storm.daemon.executor$fn__8226$fn__8239$fn__8292.invoke(executor.clj:851) > ~[storm-core-1.0.0.jar:1.0.0] > at org.apache.storm.util$async_loop$fn__554.invoke(util.clj:484) > [storm-core-1.0.0.jar:1.0.0] > at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65] > Caused by: java.lang.RuntimeException: > org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /partition_2/145665 > at > org.apache.storm.trident.topology.state.TransactionalState.setData(TransactionalState.java:119) > ~[storm-core-1.0.0.jar:1.0.0] > at > org.apache.storm.trident.topology.state.RotatingTransactionalState.overrideState(RotatingTransactionalState.java:52) > ~[storm-core-1.0.0.jar:1.0.0] > at > org.apache.storm.trident.spout.OpaquePartitionedTridentSpoutExecutor$Emitter.commit(OpaquePartitionedTridentSpoutExecutor.java:167) > ~[storm-core-1.0.0.jar:1.0.0] > at > org.apache.storm.trident.spout.TridentSpoutExecutor.execute(TridentSpoutExecutor.java:70) > ~[storm-core-1.0.0.jar:1.0.0] > at > org.apache.storm.trident.topology.TridentBoltExecutor.execute(TridentBoltExecutor.java:328) > ~[storm-core-1.0.0.jar:1.0.0] > > This error is happening all the time, we have workers crashing every few > minutes in our prod cluster currently. > > We found the following JIRA for this issue, > > https://issues.apache.org/jira/browse/STORM-1114 > > which looks similar, but we don't have the problem in our beta or alpha > cluster. > > Any help would be highly appreciated. > > Uday > >
