We are using Storm 0.9.0.1 with Netty and Trident topologies on a single
machine (nimbus, supervisor, and drpc running on the same machine). Supervisor
keeps dying and gets restarted after 7-8 seconds by Supervisord (the service
that restarts storm and zookeeper processes). Here is the error in
supervisor.log we see over and over:
2014-04-15 21:13:13 b.s.event [ERROR] Error when processing event
java.lang.RuntimeException: java.io.EOFException
at backtype.storm.utils.Utils.deserialize(Utils.java:69)
~[storm-core-0.9.0.1.jar:na]
at backtype.storm.utils.LocalState.snapshot(LocalState.java:28)
~[storm-core-0.9.0.1.jar:na]
at backtype.storm.utils.LocalState.get(LocalState.java:39)
~[storm-core-0.9.0.1.jar:na]
at
backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:187)
~[storm-core-0.9.0.1.jar:na]
at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
~[clojure-1.4.0.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
at backtype.storm.event$event_manager$fn__3072.invoke(event.clj:24)
~[storm-core-0.9.0.1.jar:na]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
at java.lang.Thread.run(Unknown Source) [na:1.7.0_45]
Caused by: java.io.EOFException: null
at java.io.ObjectInputStream$PeekInputStream.readFully(Unknown Source)
~[na:1.7.0_45]
at java.io.ObjectInputStream$BlockDataInputStream.readShort(Unknown
Source) ~[na:1.7.0_45]
at java.io.ObjectInputStream.readStreamHeader(Unknown Source)
~[na:1.7.0_45]
at java.io.ObjectInputStream.<init>(Unknown Source) ~[na:1.7.0_45]
at backtype.storm.utils.Utils.deserialize(Utils.java:64)
~[storm-core-0.9.0.1.jar:na]
... 11 common frames omitted
2014-04-15 21:13:13 b.s.util [INFO] Halting process: ("Error when processing an
event")
Any ideas why supervisor might be dying?
Per recommendation from the post "Supervisor throwing error on start up" from
https://groups.google.com/forum/#!topic/storm-user/2gapTYTRrX8, we stopped
storm processes, cleared the storm and zookeeper data directories, and it was
fine (after we loaded the topologies again). However, we would like to know
how to prevent this bug from happening in a production system environment.
We are also getting a ton of Connection refused errors in the Nimbus and Worker
logs. I expect this would be the case if Supervisor can't start up.
Thank you,
Randy