Maybe disk I/O was high? Are you capturing GC logs to disk in unique files (you can sub in the PID and timestamp into the GC log filename)? I know you believe it's not responsible, but it's the only thing I've ever found to be responsible thus far. (Except for a problem in storm 0.9.3 with netty that has since been fixed -- we worked around that by downgrading to zero-MQ.) You might try monitoring the heartbeat files written by the workers to watch for the file creation to be happening less frequently than once per second.
> all of the worker sessions expired at the same time I'm assuming you're saying that multiple workers had state :timed-out at once? Was that on the same host? If the state is :disallowed, that is perfectly normal when the reassignment happens, as I described earlier. - Erik On Sunday, May 1, 2016, Kevin Conaway <[email protected]> wrote: > Any tips on where to continue investigating or other metrics to capture? > > As i mentioned before, the topology was mostly idle. Low cpu usage, low gc > time (cms parnew), stable heap, no eth errors. Its hard to see why all of > the worker sessions expired at the same time > > On Sunday, May 1, 2016, Erik Weathers <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> To be clear, the supervisor wasn't shutting itself down, it was killing >> the worker process. >> >> Also for clarity, the :disallowed state simply means a new >> task->worker assignment has been computed by the nimbus and so the >> supervisor has noticed from the ZK state that the currently running worker >> is no longer supposed to be running. >> >> Regarding the :timed-out state it *definitely* means the worker was hung >> and couldn't write to the local heartbeat file within the timeout (30 secs >> by default). As for *why*... That is hard to determine. >> >> - Erik >> >> On Sunday, May 1, 2016, Kevin Conaway <[email protected]> wrote: >> >>> Thanks Erik. We're using Storm 0.10 so Pacemaker doesn't come in to >>> play here. >>> >>> 3. Worker process died due to exception (this is almost always what we >>> see) >>> 4. Worker process hung (eg doing GC). >>> >>> I don't think its either of these, TBH. There were no abnormal >>> terminations in the logs (or in the Storm error logs). I have the GC >>> metrics that were reported to graphite and the GC time was fairly low >>> (~50ms). >>> >>> My feeling is that it is Zookeeper related and not worker related >>> because it appears that ZK also expired sessions from the supervisor nodes >>> as well: >>> >>> I saw the following messages in the log of one of the ZK nodes: >>> >>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>> 0x45453e198e80076, timeout of 20000ms exceeded >>> >>> 2016-04-30 01:33:46,005 [myid:4] - INFO [ProcessThread(sid:4 >>> cport:-1)::PrepRequestProcessor@476] - Processed session termination >>> for sessionid: 0x45453e198e80076 >>> >>> 2016-04-30 01:33:46,009 [myid:4] - INFO >>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for >>> client /<ip>:44500 which had sessionid 0x45453e198e80076 >>> >>> And some corresponding log messages on one of the supervisor log files >>> (for the same ZK session ID). It looks like this occurred right as the >>> supervisor was shutting down due to Nimbus restarting it: >>> >>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read >>> additional data from server sessionid 0x45453e198e80076, likely server has >>> closed socket, closing socket connection and attempting reconnect >>> >>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and >>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current >>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs >>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98 >>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 74] >>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700} >>> >>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down >>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b >>> >>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user >>> 4e6934f9-88e4-4506-9d6e-25db72be941b >>> >>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for >>> execution of cleanup threads on worker. >>> >>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] >>> State change: SUSPENDED >>> >>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event >>> :disconnected::none: with disconnected Zookeeper. >>> >>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket >>> connection to server <zk host>:2181. Will not attempt to authenticate using >>> SASL (unknown error) >>> >>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket >>> connection established to <zk host>:2181, initiating session >>> >>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to >>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired, >>> closing socket connection >>> >>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] >>> State change: LOST >>> >>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event >>> :expired::none: with disconnected Zookeeper. >>> >>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session >>> expired event received >>> >>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <[email protected]> >>> wrote: >>> >>>> (I don't know anything about the pacemaker service introduced in storm >>>> 1.0, so this statement is pre-1.0). >>>> >>>> The executor threads within the worker processes write heartbeats to >>>> ZooKeeper. If they aren't successfully heartbeating then it could be many >>>> things: >>>> >>>> 1. ZK too busy? (Seems unlikely) >>>> 2. Network too busy? (Seems unlikely) >>>> 3. Worker process died due to exception (this is almost always what we >>>> see) >>>> 4. Worker process hung (eg doing GC). (This would usually first be >>>> caught by the supervisor on that host since it checks a local hearbeat file >>>> that the worker normally writes to every second -- if the heartbeat doesn't >>>> get refreshed before the timeout then the supervisor kills the worker >>>> process with State being :timed-out (or :time-out, something like that). >>>> This of course depends on the various timeout config values you have on the >>>> worker and nimbus hosts.) >>>> >>>> - Erik >>>> >>>> >>>> On Saturday, April 30, 2016, Kevin Conaway <[email protected]> >>>> wrote: >>>> >>>>> We are using Storm 0.10 and we noticed that Nimbus decided to restart >>>>> our topology. From researching past threads it seems like this is related >>>>> to not receiving heartbeats from the supervisors but I'm unsure if this >>>>> was >>>>> the case. Our topology was mostly idle at the time that the restart was >>>>> triggered. >>>>> >>>>> We have a 5 node Zookeeper (3.4.5) ensemble. On one of the ZK nodes, >>>>> I saw the following messages at the time of the restart: >>>>> >>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO >>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>> 0x45453e198e8007f, timeout of 20000ms exceeded >>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>> 0x25453e1c2640085, timeout of 20000ms exceeded >>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>> 0x45453e198e80076, timeout of 20000ms exceeded >>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO >>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>> 0x35453e1a529008b, timeout of 20000ms exceeded >>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO >>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>> 0x15453e198d10084, timeout of 20000ms exceeded >>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO >>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>> 0x35453e1a5290090, timeout of 20000ms exceeded >>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO >>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>> 0x15453e198d1008e, timeout of 20000ms exceeded >>>>> >>>>> In the nimbus log, there was the following log message: >>>>> >>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8] >>>>> not alive >>>>> >>>>> Shortly thereafter, the supervisors started restarting the workers. >>>>> The following log message was in the supervisor log: >>>>> >>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and >>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current >>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs >>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51] >>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 >>>>> 87] >>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700} >>>>> >>>>> Previous threads have suggested that this was due to heavy GC causing >>>>> the heartbeats not to reach Zookeeper but the topology was idle at this >>>>> time so I don't think GC was the culprit. The GC par new time was about >>>>> 50ms on each node (as reported to Graphite). >>>>> >>>>> Thoughts on what could have happened here and how to debug further? >>>>> >>>>> -- >>>>> Kevin Conaway >>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>>>> https://github.com/kevinconaway >>>>> >>>> >>> >>> >>> -- >>> Kevin Conaway >>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>> https://github.com/kevinconaway >>> >> > > -- > Kevin Conaway > http://www.linkedin.com/pub/kevin-conaway/7/107/580/ > https://github.com/kevinconaway > >
