NotAliveException when running storm list

Mitchell Rathbun (BLOOMBERG/ 731 LEX) Mon, 08 Apr 2019 16:49:08 -0700

We run nimbus, supervisor, and the ui daemons on the same machine as a bunch of 
our topologies. We have a start script that runs the following:


PROCESS_FILTER=`storm list | egrep -io "topology-prefix"${TOPOLOGY-ID}`
    if [[ ! -z "${PROCESS_FILTER}" ]]; then
        echo "Shutting down $TOPOLOGY_NAME in cluster mode"
        # Proper way of killing, in cluster mode
        $STORM_CMD kill $TOPOLOGY_NAME -w 5
        rc=$?
        if [[ $rc -ne 0 ]]; then
            exit ${rc}
        fi
    else
        echo "$TOPOLOGY_NAME in not running in either local-mode or 
cluster-mode"
    fi

......

Running this gave us the following stack trace in the nimbus logs:

2019-04-07 13:57:12,230 ERROR ProcessFunction [pool-14-thread-38] Internal 
error processing getClusterInfo
org.apache.storm.generated.NotAliveException: null
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method) ~[?:1.8.0_172]
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 ~[?:1.8.0_172]
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 ~[?:1.8.0_172]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
~[?:1.8.0_172]
        at clojure.lang.Reflector.invokeConstructor(Reflector.java:180) 
~[clojure-1.7.0.jar:?]
        at 
org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:562) 
~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.daemon.nimbus$get_resources_for_topology.invoke(nimbus.clj:918)
 ~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.daemon.nimbus$get_cluster_info$iter__10704__10708$fn__10709.invoke(nimbus.clj:1583)
 ~[storm-core-1.2.1.jar:1.2.1]
        at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.7.0.jar:?]
        at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.7.0.jar:?]
        at clojure.lang.Cons.next(Cons.java:39) ~[clojure-1.7.0.jar:?]
        at clojure.lang.RT.next(RT.java:674) ~[clojure-1.7.0.jar:?]
        at clojure.core$next__4112.invoke(core.clj:64) ~[clojure-1.7.0.jar:?]
        at clojure.core$dorun.invoke(core.clj:3010) ~[clojure-1.7.0.jar:?]
        at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?]
        at 
org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1564) 
~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10799.getClusterInfo(nimbus.clj:2019)
 ~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920)
 ~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904)
 ~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) 
~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.security.auth.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:144)
 ~[storm-core-1.2.1.jar:1.2.1]
        at 
org.apache.storm.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 ~[storm-core-1.2.1.jar:1.2.1]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_172]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_172]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]

Followed by:

13:57:12 tsdbdwnp289: WingmanTopology289 is not running in either local-mode or 
cluster-mode

And in Supervisor:

2019-04-07 13:57:13,559 INFO  BasicContainer [Thread-35] Worker Process 
d9622e0d-edb6-41e4-9d74-8c1a42f23ad1 exited with code: 20
2019-04-07 13:57:13,603 INFO  BasicContainer [Thread-37] Worker Process 
744ab585-b8d8-4bcd-909b-e55a46887e67 exited with code: 20
2019-04-07 13:57:13,668 INFO  BasicContainer [Thread-38] Worker Process 
26e2d425-f203-4758-82c1-2d8beaee1b00 exited with code: 20
2019-04-07 13:57:13,696 INFO  BasicContainer [Thread-34] Worker Process 
70af57ee-c871-459b-88e1-b1bc2553d832 exited with code: 20
2019-04-07 13:57:13,698 INFO  BasicContainer [Thread-40] Worker Process 
5b7bbb1d-1b84-4ee3-bcd5-35175db1b710 exited with code: 20
2019-04-07 13:57:14,218 INFO  BasicContainer [Thread-39] Worker Process 
f1abdf84-303f-4579-9e5c-ccc16c3f418e exited with code: 20
2019-04-07 13:57:14,244 INFO  BasicContainer [Thread-41] Worker Process 
37719483-8f36-4c75-8a81-4e25dc53a23d exited with code: 20

However, none of the worker processes that we had an issue with ever exited. So 
going off of the above code, I believe that this issue was caused by an 
exception from calling 'storm list'. Any idea how this could have happened? Why 
would 'storm list' cause a NotAliveException in Nimbus? It seems to be a 
transient issue, as we were able to successfully shut down the topology later 
in the day. This all occurred during a machine turn, so a lot of topologies 
were coming down in succession.

NotAliveException when running storm list

Reply via email to