I deleted my storm local dir and zookeeper data/ log dirs and restarted zookeeper and storm again and it works fine now.
This is still very weird and I am still investigating the root cause. Thanks, Rushabh From: P. Taylor Goetz [mailto:[email protected]] Sent: Tuesday, August 05, 2014 10:57 AM To: [email protected] Subject: Re: Storm workers not starting because of netty reconnect : [INFO] Reconnect started for Netty-Client I would double check to make sure hostname resolution is working properly on all hosts in the cluster, and that there are not any firewall rules that would prevent connections on the supervisor ports. I would also remove any Netty configuration overrides from storm.yaml to allow the defaults to take effect - only override the defaults when you need to. - Taylor On Aug 4, 2014, at 2:37 PM, Rushabh Shah <[email protected]<mailto:[email protected]>> wrote: Hi, I have a topology that was deployed on a storm cluster and was running fine until I started facing the following issue. I can see that in supervisor logs, the supervisor is trying to launch the topology on a worker but it is not able to start it. 2014-08-04 18:27:33 b.s.d.supervisor [INFO] Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "SALABPOSITION-5-1-2-1406938773", :executors ([3 3] [5 5] [7 7] [9 9] [11 11] [1 1])} for this supervisor 48753f4c-e0fd-48f3-a149-1f52491da5b9 on port 6702 with id f620ab27-61fd-4b87-b017-dea1e811074b 2014-08-04 18:27:33 b.s.d.supervisor [INFO] Launching worker with command: '/integral/opt/jdk16/bin/java' '-server' '-Xmx768m' '-Djava.net.preferIPv4Stack=true' '-Djava.net.preferIPv4Stack=true' '-Xmanagement:ssl=false,authenticate=false,port=7099' '-Xmx8192m' '-Djava.library.path=/app/storm/supervisor/stormdist/SALABPOSITION-5-1-2-1406938773/resources/Linux-amd64:/app/storm/supervisor/stormdist/SALABPOSITION-5-1-2-1406938773/resources:/usr/local/lib:/opt/local/lib:/usr/lib' '-Dlogfile.name=worker-6702.log' '-Dstorm.home=/integral/opt/apache-storm-0.9.2-incubating' '-Dlogback.configurationFile=/integral/opt/apache-storm-0.9.2-incubating/logback/cluster.xml' '-Dstorm.id=SALABPOSITION-5-1-2-1406938773' '-Dworker.id=f620ab27-61fd-4b87-b017-dea1e811074b' '-Dworker.port=6702' '-cp' '/integral/opt/apache-storm-0.9.2-incubating/lib/ring-devel-0.3.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/servlet-api-2.5-20081211.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/compojure-1.1.3.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/tools.cli-0.2.4.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/joda-time-2.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/carbonite-1.4.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/tools.macro-0.1.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/clj-time-0.4.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-codec-1.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-fileupload-1.2.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/httpclient-4.3.3.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/asm-4.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/logback-classic-1.0.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/jetty-6.1.26.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/ring-jetty-adapter-0.3.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/netty-3.2.2.Final.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/slf4j-api-1.6.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/guava-13.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/objenesis-1.2.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/kryo-2.21.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/httpcore-4.3.2.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/zookeeper-3.4.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/logback-core-1.0.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/jgrapht-core-0.9.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/curator-client-2.4.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-lang-2.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/snakeyaml-1.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/clj-stacktrace-0.2.4.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/minlog-1.2.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-logging-1.1.3.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/disruptor-2.10.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/log4j-over-slf4j-1.6.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/curator-framework-2.4.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/jline-2.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-exec-1.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/core.incubator-0.1.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/json-simple-1.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/hiccup-0.3.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/clojure-1.5.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/reflectasm-1.07-shaded.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/chill-java-0.3.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-io-2.4.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/clout-1.0.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/servlet-api-2.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/tools.logging-0.2.3.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/ring-core-1.1.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/netty-3.6.3.Final.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/math.numeric-tower-0.0.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/jetty-util-6.1.26.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/ring-servlet-0.3.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/storm-core-0.9.2-incubating.jar:/integral/opt/apache-storm-0.9.2-incubating/conf:/app/storm/supervisor/stormdist/SALABPOSITION-5-1-2-1406938773/stormjar.jar' 'backtype.storm.daemon.worker' 'SALABPOSITION-5-1-2-1406938773' '48753f4c-e0fd-48f3-a149-1f52491da5b9' '6702' 'f620ab27-61fd-4b87-b017-dea1e811074b' 2014-08-04 18:27:33 b.s.d.supervisor [INFO] f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started 2014-08-04 18:27:33 b.s.d.supervisor [INFO] f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started 2014-08-04 18:27:34 b.s.d.supervisor [INFO] f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started ..... After 120 seconds the supervisor will timeout and try to start the topology on another worker. 2014-08-04 18:29:32 b.s.d.supervisor [INFO] f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started 2014-08-04 18:29:32 b.s.d.supervisor [INFO] f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Worker f620ab27-61fd-4b87-b017-dea1e811074b failed to start 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Shutting down and clearing state for id f620ab27-61fd-4b87-b017-dea1e811074b. Current supervisor time: 1407176973. State: :not-started, Heartbeat: nil 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Shutting down 48753f4c-e0fd-48f3-a149-1f52491da5b9:f620ab27-61fd-4b87-b017-dea1e811074b 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Shut down 48753f4c-e0fd-48f3-a149-1f52491da5b9:f620ab27-61fd-4b87-b017-dea1e811074b 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "SALABPOSITION-5-1-2-1406938773", :executors ([3 3] [5 5] [7 7] [9 9] [11 11] [1 1])} for this supervisor 48753f4c-e0fd-48f3-a149-1f52491da5b9 on port 6703 with id c290b2ec-7969-44ca-ac3e-008b8841ef3f And this process keeps on repeating. On the worker logs, I see the following : 2014-08-04 08:09:53 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-supervisor2.integral.com/192.168.239.166:6703<http://netty-client-supervisor2.integral.com/192.168.239.166:6703>... [14] 2014-08-04 08:09:54 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-supervisor2.integral.com/192.168.239.166:6703<http://netty-client-supervisor2.integral.com/192.168.239.166:6703>... [15] 2014-08-04 08:09:55 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-supervisor2.integral.com/192.168.239.166:6703<http://netty-client-supervisor2.integral.com/192.168.239.166:6703>... [16] ...... 2014-08-04 08:10:10 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-supervisor2.integral.com/192.168.239.166:6703<http://netty-client-supervisor2.integral.com/192.168.239.166:6703> 2014-08-04 08:10:10 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-supervisor2.integral.com/192.168.239.166:6703<http://netty-client-supervisor2.integral.com/192.168.239.166:6703>..., timeout: 600000ms, pendings: 0 2014-08-04 08:10:10 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-supervisor2.integral.com/192.168.239.166:6701<http://netty-client-supervisor2.integral.com/192.168.239.166:6701>... [0] 2014-08-04 08:10:10 b.s.util [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] at java.lang.Thread.run(Thread.java:662) [na:1.6.0_31] Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.messaging.netty.Client.send(Client.java:194) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:323) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] ... 6 common frames omitted 2014-08-04 08:10:10 b.s.util [INFO] Halting process: ("Async loop died!") It seems that the supervisor is not able to communicate with the workers because of some netty connection issues. I would appreciate if somebody can help me in this regard. Thanks, Rushabh
