The supervisor is reporting the worker “still hasn’t started” even though the 
worker is up and appears to be working. After timeout, the supervisor kills and 
restarts the worker.

I set the timeouts really high (10 mins) to see what would happen. I though 
maybe a GC pause or something. The worker is still killed with these large 
timeouts, it just takes longer.

The supervisor is able to establish a connection to the worker port:
[root@twig07 storm]# netstat -pan | grep 6700
tcp        0      0 :::6700                     :::*                        
LISTEN      6860/java
tcp        0      0 ::ffff:10.0.1.7:6700        ::ffff:10.0.1.8:44324       
ESTABLISHED 6860/java
tcp        0      0 ::ffff:10.0.1.7:36940       ::ffff:10.0.1.8:6700        
ESTABLISHED 6860/java
tcp        0    256 ::ffff:10.0.1.7:36543       ::ffff:10.0.1.6:6700        
ESTABLISHED 6860/java
tcp        0      0 ::ffff:10.0.1.7:6700        ::ffff:10.0.1.5:38746       
ESTABLISHED 6860/java
tcp        0      0 ::ffff:10.0.1.7:59151       ::ffff:10.0.1.5:6700        
ESTABLISHED 6860/java
tcp        0      0 ::ffff:10.0.1.7:6700        ::ffff:10.0.1.6:59916       
ESTABLISHED 6860/java



What can I look at to diagnose this behavior?





Supervisor log:
2015-04-15 18:32:52 b.s.d.supervisor [INFO] 
8e836654-931d-46d5-bf3f-cb75e394b113 still hasn't started
2015-04-15 18:32:52 b.s.d.supervisor [INFO] 
8e836654-931d-46d5-bf3f-cb75e394b113 still hasn't started
2015-04-15 18:54:01 b.s.d.supervisor [INFO] Shutting down and clearing state 
for id 8e836654-931d-46d5-bf3f-cb75e394b113. Current supervisor time: 
1429138441. State: :disallowed, Heartbeat: 
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1429138440, :storm-id 
"burner-trident-1-1429131611", :executors #{[4 4] [36 36] [68 68] [100 100] [8 
8] [40 40] [72 72] [104 104] [12 12] [44 44] [76 76] [108 108] [16 16] [48 48] 
[80 80] [112 112] [20 20] [52 52] [84 84] [116 116] [24 24] [56 56] [88 88] 
[120 120] [28 28] [60 60] [92 92] [124 124] [-1 -1] [32 32] [64 64] [96 96] 
[128 128]}, :port 6700}
2015-04-15 18:54:01 b.s.d.supervisor [INFO] Shutting down 
c60221c8-1c96-41b5-9669-4fa8b74fc022:8e836654-931d-46d5-bf3f-cb75e394b113
2015-04-15 18:54:01 b.s.config [INFO] GET worker-user 
8e836654-931d-46d5-bf3f-cb75e394b113
2015-04-15 18:54:02 b.s.config [INFO] REMOVE worker-user 
8e836654-931d-46d5-bf3f-cb75e394b113
2015-04-15 18:54:02 b.s.d.supervisor [INFO] Shut down 
c60221c8-1c96-41b5-9669-4fa8b74fc022:8e836654-931d-46d5-bf3f-cb75e394b113



Worker log for the same time period:
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig06.twigs:6667 for 
producing
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig08.twigs:6667 for 
producing
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig05.twigs:6667 for 
producing
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig07.twigs:6667
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig08.twigs:6667
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig05.twigs:6667
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig06.twigs:6667
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from twig07.twigs:6667
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig07.twigs:6667 for 
producing
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig06.twigs:6667 for 
producing
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig08.twigs:6667 for 
producing
2015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to twig05.twigs:6667 for 
producing
2015-04-15 18:44:06 s.k.t.ZkBrokerReader [INFO] brokers need refreshing because 
60000ms have expired
2015-04-15 18:44:06 s.k.DynamicBrokersReader [INFO] Read partition info from 
zookeeper: GlobalPartitionInformation{partitionMap={0=twig07.twigs:6667, 
1=twig08.twigs:6667, 2=twig05.twigs:6667, 3=twig06.twigs:6667}}
2015-04-15 18:45:14 s.k.t.ZkBrokerReader [INFO] brokers need refreshing because 
60000ms have expired
2015-04-15 18:45:14 s.k.DynamicBrokersReader [INFO] Read partition info from 
zookeeper: GlobalPartitionInformation{partitionMap={0=twig07.twigs:6667, 
1=twig08.twigs:6667, 2=twig05.twigs:6667, 3=twig06.twigs:6667}}
2015-04-15 18:54:01 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too 
large (30). Pinning to 29
2015-04-15 18:54:01 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The 
baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [30]
2015-04-15 18:54:01 b.s.m.n.Client [INFO] New Netty Client, connect to 
twig07.twigs, 6700, config: , buffer_size: 5242880
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-twig07.twigs/10.0.1.7:6700... [0]
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-twig08.twigs/10.0.1.8:6700
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent 
with Netty-Client-twig08.twigs/10.0.1.8:6700..., timeout: 600000ms, pendings: 0
2015-04-15 18:54:01 b.s.m.n.Client [INFO] connection established to a remote 
host Netty-Client-twig07.twigs/10.0.1.7:6700, [id: 0x3a394433, /10.0.1.7:35479 
=> twig07.twigs/10.0.1.7:6700]
2015-04-15 18:54:01 b.s.d.worker [INFO] Shutting down worker 
burner-trident-1-1429131611 c60221c8-1c96-41b5-9669-4fa8b74fc022 6700
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-twig07.twigs/10.0.1.7:6700
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent 
with Netty-Client-twig07.twigs/10.0.1.7:6700..., timeout: 600000ms, pendings: 0
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-twig06.twigs/10.0.1.6:6700
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent 
with Netty-Client-twig06.twigs/10.0.1.6:6700..., timeout: 600000ms, pendings: 0
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-twig05.twigs/10.0.1.5:6700
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent 
with Netty-Client-twig05.twigs/10.0.1.5:6700..., timeout: 600000ms, pendings: 0
2015-04-15 18:54:01 b.s.d.worker [INFO] Shutting down receive thread
2015-04-15 18:54:01 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too 
large (30). Pinning to 29
2015-04-15 18:54:01 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The 
baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [30]
2015-04-15 18:54:01 b.s.m.n.Client [INFO] New Netty Client, connect to 
localhost, 6700, config: , buffer_size: 5242880
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-localhost/127.0.0.1:6700... [0]
2015-04-15 18:54:01 b.s.m.loader [INFO] Shutting down receiving-thread: 
[burner-trident-1-1429131611, 6700]
2015-04-15 18:54:01 b.s.m.n.Client [INFO] connection established to a remote 
host Netty-Client-localhost/127.0.0.1:6700, [id: 0x9e468dc6, /127.0.0.1:59930 
=> localhost/127.0.0.1:6700]
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-localhost/127.0.0.1:6700
2015-04-15 18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent 
with Netty-Client-localhost/127.0.0.1:6700..., timeout: 600000ms, pendings: 0
2015-04-15 18:54:01 b.s.m.loader [INFO] Waiting for 
receiving-thread:[burner-trident-1-1429131611, 6700] to die
2015-04-15 18:54:01 b.s.m.loader [INFO] Shutdown receiving-thread: 
[burner-trident-1-1429131611, 6700]
2015-04-15 18:54:01 b.s.d.worker [INFO] Shut down receive thread
2015-04-15 18:54:01 b.s.d.worker [INFO] Terminating messaging context
2015-04-15 18:54:01 b.s.d.worker [INFO] Shutting down executors
2015-04-15 18:54:01 b.s.d.executor [INFO] Shutting down executor __acker:[4 4]
2015-04-15 18:54:01 b.s.util [INFO] Async loop interrupted!
2015-04-15 18:54:01 b.s.util [INFO] Async loop interrupted!
2015-04-15 18:54:01 b.s.d.executor [INFO] Shut down executor __acker:[4 4]
2015-04-15 18:54:01 b.s.d.executor [INFO] Shutting down executor 
b-0-To-DC-Sink:[36 36]
2015-04-15 18:54:01 b.s.util [INFO] Async loop interrupted!
2015-04-15 18:54:01 b.s.util [INFO] Async loop interrupted!
2015-04-15 18:54:01 c.c.t.b.t.f.FlushingFunction [ERROR] sleeping in 
flushThread interrupted
java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method) [na:1.7.0_75]
        at 
com.cisco.tinderbox.burner.trident.functions.FlushingFunction$1.run(FlushingFunction.java:84)
 ~[stormjar.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]

[http://www.cisco.com/web/europe/images/email/signature/est2014/logo_06.png?ct=1398192119726]

Grant Overby
Software Engineer
Cisco.com<http://www.cisco.com/>
[email protected]<mailto:[email protected]>
Mobile: 865 724 4910






[http://www.cisco.com/assets/swa/img/thinkbeforeyouprint.gif] Think before you 
print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.

Please click 
here<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for 
Company Registration Information.




Reply via email to