We run nimbus, supervisor, and the ui daemons on the same machine as a bunch of
our topologies. We have a start script that runs the following:
PROCESS_FILTER=`storm list | egrep -io "topology-prefix"${TOPOLOGY-ID}`
if [[ ! -z "${PROCESS_FILTER}" ]]; then
echo "Shutting down $TOPOLOGY_NAME in cluster mode"
# Proper way of killing, in cluster mode
$STORM_CMD kill $TOPOLOGY_NAME -w 5
rc=$?
if [[ $rc -ne 0 ]]; then
exit ${rc}
fi
else
echo "$TOPOLOGY_NAME in not running in either local-mode or
cluster-mode"
fi
......
Running this gave us the following stack trace in the nimbus logs:
2019-04-07 13:57:12,230 ERROR ProcessFunction [pool-14-thread-38] Internal
error processing getClusterInfo
org.apache.storm.generated.NotAliveException: null
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) ~[?:1.8.0_172]
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
~[?:1.8.0_172]
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
~[?:1.8.0_172]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
~[?:1.8.0_172]
at clojure.lang.Reflector.invokeConstructor(Reflector.java:180)
~[clojure-1.7.0.jar:?]
at
org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:562)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.daemon.nimbus$get_resources_for_topology.invoke(nimbus.clj:918)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.daemon.nimbus$get_cluster_info$iter__10704__10708$fn__10709.invoke(nimbus.clj:1583)
~[storm-core-1.2.1.jar:1.2.1]
at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.7.0.jar:?]
at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.7.0.jar:?]
at clojure.lang.Cons.next(Cons.java:39) ~[clojure-1.7.0.jar:?]
at clojure.lang.RT.next(RT.java:674) ~[clojure-1.7.0.jar:?]
at clojure.core$next__4112.invoke(core.clj:64) ~[clojure-1.7.0.jar:?]
at clojure.core$dorun.invoke(core.clj:3010) ~[clojure-1.7.0.jar:?]
at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?]
at
org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1564)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10799.getClusterInfo(nimbus.clj:2019)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.security.auth.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:144)
~[storm-core-1.2.1.jar:1.2.1]
at
org.apache.storm.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
~[storm-core-1.2.1.jar:1.2.1]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_172]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_172]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Followed by:
13:57:12 tsdbdwnp289: WingmanTopology289 is not running in either local-mode or
cluster-mode
And in Supervisor:
2019-04-07 13:57:13,559 INFO BasicContainer [Thread-35] Worker Process
d9622e0d-edb6-41e4-9d74-8c1a42f23ad1 exited with code: 20
2019-04-07 13:57:13,603 INFO BasicContainer [Thread-37] Worker Process
744ab585-b8d8-4bcd-909b-e55a46887e67 exited with code: 20
2019-04-07 13:57:13,668 INFO BasicContainer [Thread-38] Worker Process
26e2d425-f203-4758-82c1-2d8beaee1b00 exited with code: 20
2019-04-07 13:57:13,696 INFO BasicContainer [Thread-34] Worker Process
70af57ee-c871-459b-88e1-b1bc2553d832 exited with code: 20
2019-04-07 13:57:13,698 INFO BasicContainer [Thread-40] Worker Process
5b7bbb1d-1b84-4ee3-bcd5-35175db1b710 exited with code: 20
2019-04-07 13:57:14,218 INFO BasicContainer [Thread-39] Worker Process
f1abdf84-303f-4579-9e5c-ccc16c3f418e exited with code: 20
2019-04-07 13:57:14,244 INFO BasicContainer [Thread-41] Worker Process
37719483-8f36-4c75-8a81-4e25dc53a23d exited with code: 20
However, none of the worker processes that we had an issue with ever exited. So
going off of the above code, I believe that this issue was caused by an
exception from calling 'storm list'. Any idea how this could have happened? Why
would 'storm list' cause a NotAliveException in Nimbus? It seems to be a
transient issue, as we were able to successfully shut down the topology later
in the day. This all occurred during a machine turn, so a lot of topologies
were coming down in succession.