I would double check to make sure hostname resolution is working properly on 
all hosts in the cluster, and that there are not any firewall rules that would 
prevent connections on the supervisor ports.

I would also remove any Netty configuration overrides from storm.yaml to allow 
the defaults to take effect — only override the defaults when you need to.

- Taylor


On Aug 4, 2014, at 2:37 PM, Rushabh Shah <[email protected]> wrote:

> Hi,
> I have a topology that was deployed on a storm cluster and was running fine 
> until I started facing the following issue.
>  
> I can see that in supervisor logs, the supervisor is trying to launch the 
> topology on a worker but it is not able to start it.
>  
>  
>  
> 2014-08-04 18:27:33 b.s.d.supervisor [INFO] Launching worker with assignment 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "SALABPOSITION-5-1-2-1406938773", :executors ([3 3] [5 5] [7 7] [9 9] [11 11] 
> [1 1])} for this supervisor 48753f4c-e0fd-48f3-a149-1f52491da5b9 on port 6702 
> with id f620ab27-61fd-4b87-b017-dea1e811074b
> 2014-08-04 18:27:33 b.s.d.supervisor [INFO] Launching worker with command: 
> '/integral/opt/jdk16/bin/java' '-server' '-Xmx768m' 
> '-Djava.net.preferIPv4Stack=true' '-Djava.net.preferIPv4Stack=true' 
> '-Xmanagement:ssl=false,authenticate=false,port=7099' '-Xmx8192m' 
> '-Djava.library.path=/app/storm/supervisor/stormdist/SALABPOSITION-5-1-2-1406938773/resources/Linux-amd64:/app/storm/supervisor/stormdist/SALABPOSITION-5-1-2-1406938773/resources:/usr/local/lib:/opt/local/lib:/usr/lib'
>  '-Dlogfile.name=worker-6702.log' 
> '-Dstorm.home=/integral/opt/apache-storm-0.9.2-incubating' 
> '-Dlogback.configurationFile=/integral/opt/apache-storm-0.9.2-incubating/logback/cluster.xml'
>  '-Dstorm.id=SALABPOSITION-5-1-2-1406938773' 
> '-Dworker.id=f620ab27-61fd-4b87-b017-dea1e811074b' '-Dworker.port=6702' '-cp' 
> '/integral/opt/apache-storm-0.9.2-incubating/lib/ring-devel-0.3.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/servlet-api-2.5-20081211.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/compojure-1.1.3.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/tools.cli-0.2.4.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/joda-time-2.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/carbonite-1.4.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/tools.macro-0.1.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/clj-time-0.4.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-codec-1.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-fileupload-1.2.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/httpclient-4.3.3.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/asm-4.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/logback-classic-1.0.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/jetty-6.1.26.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/ring-jetty-adapter-0.3.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/netty-3.2.2.Final.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/slf4j-api-1.6.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/guava-13.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/objenesis-1.2.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/kryo-2.21.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/httpcore-4.3.2.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/zookeeper-3.4.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/logback-core-1.0.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/jgrapht-core-0.9.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/curator-client-2.4.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-lang-2.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/snakeyaml-1.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/clj-stacktrace-0.2.4.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/minlog-1.2.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-logging-1.1.3.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/disruptor-2.10.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/log4j-over-slf4j-1.6.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/curator-framework-2.4.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/jline-2.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-exec-1.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/core.incubator-0.1.0.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/json-simple-1.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/hiccup-0.3.6.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/clojure-1.5.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/reflectasm-1.07-shaded.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/chill-java-0.3.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/commons-io-2.4.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/clout-1.0.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/servlet-api-2.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/tools.logging-0.2.3.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/ring-core-1.1.5.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/netty-3.6.3.Final.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/math.numeric-tower-0.0.1.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/jetty-util-6.1.26.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/ring-servlet-0.3.11.jar:/integral/opt/apache-storm-0.9.2-incubating/lib/storm-core-0.9.2-incubating.jar:/integral/opt/apache-storm-0.9.2-incubating/conf:/app/storm/supervisor/stormdist/SALABPOSITION-5-1-2-1406938773/stormjar.jar'
>  'backtype.storm.daemon.worker' 'SALABPOSITION-5-1-2-1406938773' 
> '48753f4c-e0fd-48f3-a149-1f52491da5b9' '6702' 
> 'f620ab27-61fd-4b87-b017-dea1e811074b'
> 2014-08-04 18:27:33 b.s.d.supervisor [INFO] 
> f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started
> 2014-08-04 18:27:33 b.s.d.supervisor [INFO] 
> f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started
> 2014-08-04 18:27:34 b.s.d.supervisor [INFO] 
> f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started
> …..
>  
> After 120 seconds the supervisor will timeout and try to start the topology 
> on another worker.
>  
>  
> 2014-08-04 18:29:32 b.s.d.supervisor [INFO] 
> f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started
> 2014-08-04 18:29:32 b.s.d.supervisor [INFO] 
> f620ab27-61fd-4b87-b017-dea1e811074b still hasn't started
> 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Worker 
> f620ab27-61fd-4b87-b017-dea1e811074b failed to start
> 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Shutting down and clearing state 
> for id f620ab27-61fd-4b87-b017-dea1e811074b. Current supervisor time: 
> 1407176973. State: :not-started, Heartbeat: nil
> 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Shutting down 
> 48753f4c-e0fd-48f3-a149-1f52491da5b9:f620ab27-61fd-4b87-b017-dea1e811074b
> 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Shut down 
> 48753f4c-e0fd-48f3-a149-1f52491da5b9:f620ab27-61fd-4b87-b017-dea1e811074b
> 2014-08-04 18:29:33 b.s.d.supervisor [INFO] Launching worker with assignment 
> #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id 
> "SALABPOSITION-5-1-2-1406938773", :executors ([3 3] [5 5] [7 7] [9 9] [11 11] 
> [1 1])} for this supervisor 48753f4c-e0fd-48f3-a149-1f52491da5b9 on port 6703 
> with id c290b2ec-7969-44ca-ac3e-008b8841ef3f
>  
>  
> And this process keeps on repeating.
>  
>  
> On the worker logs, I see the following :
>  
>  
> 2014-08-04 08:09:53 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-supervisor2.integral.com/192.168.239.166:6703... [14]
> 2014-08-04 08:09:54 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-supervisor2.integral.com/192.168.239.166:6703... [15]
> 2014-08-04 08:09:55 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-supervisor2.integral.com/192.168.239.166:6703... [16]
> ……
> 2014-08-04 08:10:10 b.s.m.n.Client [INFO] Closing Netty Client 
> Netty-Client-supervisor2.integral.com/192.168.239.166:6703
> 2014-08-04 08:10:10 b.s.m.n.Client [INFO] Waiting for pending batchs to be 
> sent with Netty-Client-supervisor2.integral.com/192.168.239.166:6703..., 
> timeout: 600000ms, pendings: 0
> 2014-08-04 08:10:10 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-supervisor2.integral.com/192.168.239.166:6701... [0]
> 2014-08-04 08:10:10 b.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: Client is being 
> closed, and does not take requests any more
>         at 
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at 
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at 
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at 
> backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
>         at java.lang.Thread.run(Thread.java:662) [na:1.6.0_31]
> Caused by: java.lang.RuntimeException: Client is being closed, and does not 
> take requests any more
>         at backtype.storm.messaging.netty.Client.send(Client.java:194) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at 
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at 
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:323)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at 
> backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         at 
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>         ... 6 common frames omitted
> 2014-08-04 08:10:10 b.s.util [INFO] Halting process: ("Async loop died!")
>  
>  
> It seems that the supervisor is not able to communicate with the workers 
> because of some netty connection issues.
>  
> I would appreciate if somebody can help me in this regard.
>  
> Thanks,
> Rushabh

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to