Are you sure the worker is writing heartbeats and the supervisor is reading
them?
On Thursday, April 16, 2015 1:36 PM, Grant Overby (groverby)
<[email protected]> wrote:
The supervisor is reporting the worker “still hasn’t started” even though the
worker is up and appears to be working. After timeout, the supervisor kills and
restarts the worker.
I set the timeouts really high (10 mins) to see what would happen. I though
maybe a GC pause or something. The worker is still killed with these large
timeouts, it just takes longer.
The supervisor is able to establish a connection to the worker
port:[root@twig07 storm]# netstat -pan | grep 6700tcp 0 0 :::6700
:::* LISTEN 6860/java
tcp 0 0 ::ffff:10.0.1.7:6700 ::ffff:10.0.1.8:44324
ESTABLISHED 6860/java tcp 0 0 ::ffff:10.0.1.7:36940
::ffff:10.0.1.8:6700 ESTABLISHED 6860/java tcp 0 256
::ffff:10.0.1.7:36543 ::ffff:10.0.1.6:6700 ESTABLISHED 6860/java
tcp 0 0 ::ffff:10.0.1.7:6700 ::ffff:10.0.1.5:38746
ESTABLISHED 6860/java tcp 0 0 ::ffff:10.0.1.7:59151
::ffff:10.0.1.5:6700 ESTABLISHED 6860/java tcp 0
0 ::ffff:10.0.1.7:6700 ::ffff:10.0.1.6:59916 ESTABLISHED
6860/java
What can I look at to diagnose this behavior?
Supervisor log:2015-04-15 18:32:52 b.s.d.supervisor [INFO]
8e836654-931d-46d5-bf3f-cb75e394b113 still hasn't started2015-04-15 18:32:52
b.s.d.supervisor [INFO] 8e836654-931d-46d5-bf3f-cb75e394b113 still hasn't
started2015-04-15 18:54:01 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 8e836654-931d-46d5-bf3f-cb75e394b113. Current supervisor time:
1429138441. State: :disallowed, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1429138440, :storm-id
"burner-trident-1-1429131611", :executors #{[4 4] [36 36] [68 68] [100 100] [8
8] [40 40] [72 72] [104 104] [12 12] [44 44] [76 76] [108 108] [16 16] [48 48]
[80 80] [112 112] [20 20] [52 52] [84 84] [116 116] [24 24] [56 56] [88 88]
[120 120] [28 28] [60 60] [92 92] [124 124] [-1 -1] [32 32] [64 64] [96 96]
[128 128]}, :port 6700}2015-04-15 18:54:01 b.s.d.supervisor [INFO] Shutting
down
c60221c8-1c96-41b5-9669-4fa8b74fc022:8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15
18:54:01 b.s.config [INFO] GET worker-user
8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15 18:54:02 b.s.config [INFO]
REMOVE worker-user 8e836654-931d-46d5-bf3f-cb75e394b1132015-04-15 18:54:02
b.s.d.supervisor [INFO] Shut down
c60221c8-1c96-41b5-9669-4fa8b74fc022:8e836654-931d-46d5-bf3f-cb75e394b113
Worker log for the same time period:2015-04-15 18:43:16 k.p.SyncProducer [INFO]
Connected to twig06.twigs:6667 for producing2015-04-15 18:43:16
k.p.SyncProducer [INFO] Connected to twig08.twigs:6667 for producing2015-04-15
18:43:16 k.p.SyncProducer [INFO] Connected to twig05.twigs:6667 for
producing2015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from
twig07.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from
twig08.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from
twig05.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from
twig06.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Disconnecting from
twig07.twigs:66672015-04-15 18:43:16 k.p.SyncProducer [INFO] Connected to
twig07.twigs:6667 for producing2015-04-15 18:43:16 k.p.SyncProducer [INFO]
Connected to twig06.twigs:6667 for producing2015-04-15 18:43:16
k.p.SyncProducer [INFO] Connected to twig08.twigs:6667 for producing2015-04-15
18:43:16 k.p.SyncProducer [INFO] Connected to twig05.twigs:6667 for
producing2015-04-15 18:44:06 s.k.t.ZkBrokerReader [INFO] brokers need
refreshing because 60000ms have expired2015-04-15 18:44:06
s.k.DynamicBrokersReader [INFO] Read partition info from zookeeper:
GlobalPartitionInformation{partitionMap={0=twig07.twigs:6667,
1=twig08.twigs:6667, 2=twig05.twigs:6667, 3=twig06.twigs:6667}}2015-04-15
18:45:14 s.k.t.ZkBrokerReader [INFO] brokers need refreshing because 60000ms
have expired2015-04-15 18:45:14 s.k.DynamicBrokersReader [INFO] Read partition
info from zookeeper:
GlobalPartitionInformation{partitionMap={0=twig07.twigs:6667,
1=twig08.twigs:6667, 2=twig05.twigs:6667, 3=twig06.twigs:6667}}2015-04-15
18:54:01 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too large (30).
Pinning to 292015-04-15 18:54:01 b.s.u.StormBoundedExponentialBackoffRetry
[INFO] The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries
[30]2015-04-15 18:54:01 b.s.m.n.Client [INFO] New Netty Client, connect to
twig07.twigs, 6700, config: , buffer_size: 52428802015-04-15 18:54:01
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-twig07.twigs/10.0.1.7:6700... [0]2015-04-15 18:54:01
b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-twig08.twigs/10.0.1.8:67002015-04-15 18:54:01 b.s.m.n.Client
[INFO] Waiting for pending batchs to be sent with
Netty-Client-twig08.twigs/10.0.1.8:6700..., timeout: 600000ms, pendings:
02015-04-15 18:54:01 b.s.m.n.Client [INFO] connection established to a remote
host Netty-Client-twig07.twigs/10.0.1.7:6700, [id: 0x3a394433, /10.0.1.7:35479
=> twig07.twigs/10.0.1.7:6700]2015-04-15 18:54:01 b.s.d.worker [INFO] Shutting
down worker burner-trident-1-1429131611 c60221c8-1c96-41b5-9669-4fa8b74fc022
67002015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-twig07.twigs/10.0.1.7:67002015-04-15 18:54:01 b.s.m.n.Client
[INFO] Waiting for pending batchs to be sent with
Netty-Client-twig07.twigs/10.0.1.7:6700..., timeout: 600000ms, pendings:
02015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-twig06.twigs/10.0.1.6:67002015-04-15 18:54:01 b.s.m.n.Client
[INFO] Waiting for pending batchs to be sent with
Netty-Client-twig06.twigs/10.0.1.6:6700..., timeout: 600000ms, pendings:
02015-04-15 18:54:01 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-twig05.twigs/10.0.1.5:67002015-04-15 18:54:01 b.s.m.n.Client
[INFO] Waiting for pending batchs to be sent with
Netty-Client-twig05.twigs/10.0.1.5:6700..., timeout: 600000ms, pendings:
02015-04-15 18:54:01 b.s.d.worker [INFO] Shutting down receive thread2015-04-15
18:54:01 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too large (30).
Pinning to 292015-04-15 18:54:01 b.s.u.StormBoundedExponentialBackoffRetry
[INFO] The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries
[30]2015-04-15 18:54:01 b.s.m.n.Client [INFO] New Netty Client, connect to
localhost, 6700, config: , buffer_size: 52428802015-04-15 18:54:01
b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-localhost/127.0.0.1:6700... [0]2015-04-15 18:54:01 b.s.m.loader
[INFO] Shutting down receiving-thread: [burner-trident-1-1429131611,
6700]2015-04-15 18:54:01 b.s.m.n.Client [INFO] connection established to a
remote host Netty-Client-localhost/127.0.0.1:6700, [id: 0x9e468dc6,
/127.0.0.1:59930 => localhost/127.0.0.1:6700]2015-04-15 18:54:01 b.s.m.n.Client
[INFO] Closing Netty Client Netty-Client-localhost/127.0.0.1:67002015-04-15
18:54:01 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with
Netty-Client-localhost/127.0.0.1:6700..., timeout: 600000ms, pendings:
02015-04-15 18:54:01 b.s.m.loader [INFO] Waiting for
receiving-thread:[burner-trident-1-1429131611, 6700] to die2015-04-15 18:54:01
b.s.m.loader [INFO] Shutdown receiving-thread: [burner-trident-1-1429131611,
6700]2015-04-15 18:54:01 b.s.d.worker [INFO] Shut down receive thread2015-04-15
18:54:01 b.s.d.worker [INFO] Terminating messaging context2015-04-15 18:54:01
b.s.d.worker [INFO] Shutting down executors2015-04-15 18:54:01 b.s.d.executor
[INFO] Shutting down executor __acker:[4 4]2015-04-15 18:54:01 b.s.util [INFO]
Async loop interrupted!2015-04-15 18:54:01 b.s.util [INFO] Async loop
interrupted!2015-04-15 18:54:01 b.s.d.executor [INFO] Shut down executor
__acker:[4 4]2015-04-15 18:54:01 b.s.d.executor [INFO] Shutting down executor
b-0-To-DC-Sink:[36 36]2015-04-15 18:54:01 b.s.util [INFO] Async loop
interrupted!2015-04-15 18:54:01 b.s.util [INFO] Async loop
interrupted!2015-04-15 18:54:01 c.c.t.b.t.f.FlushingFunction [ERROR] sleeping
in flushThread interruptedjava.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method) [na:1.7.0_75] at
com.cisco.tinderbox.burner.trident.functions.FlushingFunction$1.run(FlushingFunction.java:84)
~[stormjar.jar:na] at java.lang.Thread.run(Thread.java:745)
[na:1.7.0_75]
|
| |
| Grant Overby
Software Engineer
Cisco.com
[email protected]
Mobile: 865 724 4910
|
|
| |
| Think before you print. |
| This email may contain confidential and privileged material for the sole use
of the intended recipient. Any review, use, distribution or disclosure by
others is strictly prohibited. If you are not the intended recipient (or
authorized to receive for the recipient), please contact the sender by reply
email and delete all copies of this message.Please click here for Company
Registration Information. |
|