Some more interesting info. As I mentioned in the last post, we have 6 workers. Each worker has 2 slots.
Only one of the supervisors received a state :timed-out message. After it received the message, it restarted both workers successfully. The other 5 supervisors all experienced the same issue as I mentioned in the last post. The workers were launched twice. Whats interesting is that both workers on those 5 nodes came up the second time in a hung state, they all had a lot of ERROR messages from the netty client about not being able to connect to the other nodes: 2016-04-30 01:34:59.758 b.s.m.n.Client [ERROR] connection attempt 26 to Netty-Client-StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700 failed: java.net.ConnectException: Connection refused: StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700 It looks like the bolts and spouts never got fully prepared (as evidenced by lack of log messages indicating so) On the one supervisor which _did_ receive the timeout message, the 2 workers came up just fine. On Mon, May 2, 2016 at 11:21 AM, Kevin Conaway <[email protected]> wrote: > Unfortunately we're not capturing disk i/o in our metrics, I can look in > to doing that for next time. > > We're not capturing GC logs, we are using the graphite storm metric > consumer to push metrics to graphite, one of which is the GC time from the > default GC mxbean. > > > I'm assuming you're saying that multiple workers had state :timed-out at > once? > > We have 6 workers. Only one had state :timed-out, the others had state > :disallowed. Looking at one of the supervisors as an example, it looks > like it received multiple requests to reschedule the worker which caused > the worker to be launched multiple times. Is this normal? > > 2016-04-30 01:34:00.872 b.s.d.supervisor [INFO] Shutting down and clearing > state for id 589b0ed5-c4e9-422f-a6c5-5d65145915f7. Current supervisor time: > 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040, > :storm-id "<topology>", :executors [[124 124] [64 64] [196 196] [40 40] [28 > 28] [184 184] [100 100] [-1 -1] [172 172] [16 16] [52 52] [148 148] [136 > 136] [112 112] [76 76] [88 88] [160 160] [4 4]], :port 6700} > > 2016-04-30 01:34:00.874 b.s.d.supervisor [INFO] Shutting down > 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:589b0ed5-c4e9-422f-a6c5-5d65145915f7 > > 2016-04-30 01:34:02.013 b.s.d.supervisor [INFO] Shutting down and clearing > state for id 460a7d66-06de-4ca5-9140-7d46dcdea841. Current supervisor time: > 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040, > :storm-id "<topology>", :executors [[178 178] [58 58] [190 190] [118 118] > [22 22] [142 142] [-1 -1] [166 166] [106 106] [70 70] [10 10] [46 46] [82 > 82] [154 154] [94 94] [34 34] [130 130]], :port 6701} > > 2016-04-30 01:34:02.014 b.s.d.supervisor [INFO] Shutting down > 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:460a7d66-06de-4ca5-9140-7d46dcdea841 > > 2016-04-30 01:34:03.095 b.s.d.supervisor [INFO] Launching worker with > assignment {:storm-id "<topology>", :executors [[3 3] [33 33] [103 103] > [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [153 > 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113 > 113]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port > 6700 with id 567a18fd-33d1-49b6-a3f4-ace65641bd67 > > 2016-04-30 01:34:03.122 b.s.d.supervisor [INFO] Launching worker with > assignment {:storm-id "<topology>", :executors [[8 8] [188 188] [68 68] > [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48 > 48] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78 > 78]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port 6701 > with id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6 > > > 2016-04-30 01:34:41.322 b.s.d.supervisor [INFO] Shutting down and clearing > state for id 567a18fd-33d1-49b6-a3f4-ace65641bd67. Current supervisor time: > 1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080, > :storm-id "<topology>", :executors [[3 3] [33 33] [103 103] [163 163] [53 > 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [-1 -1] [153 153] [13 > 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113 113]], > :port 6700} > > 2016-04-30 01:34:41.323 b.s.d.supervisor [INFO] Shutting down > 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:567a18fd-33d1-49b6-a3f4-ace65641bd67 > > 2016-04-30 01:34:42.353 b.s.d.supervisor [INFO] Shutting down and clearing > state for id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6. Current supervisor time: > 1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080, > :storm-id "<topology>", :executors [[8 8] [188 188] [68 68] [198 198] [178 > 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48 48] [-1 -1] [148 > 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78 78]], > :port 6701} > > 2016-04-30 01:34:42.354 b.s.d.supervisor [INFO] Shutting down > 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:9fcc869d-08d7-44ec-bde2-bf9ed86403e6 > > On Sun, May 1, 2016 at 5:52 PM, Erik Weathers <[email protected]> > wrote: > >> Maybe disk I/O was high? Are you capturing GC logs to disk in unique >> files (you can sub in the PID and timestamp into the GC log filename)? I >> know you believe it's not responsible, but it's the only thing I've ever >> found to be responsible thus far. (Except for a problem in storm 0.9.3 >> with netty that has since been fixed -- we worked around that by >> downgrading to zero-MQ.) You might try monitoring the heartbeat files >> written by the workers to watch for the file creation to be happening less >> frequently than once per second. >> >> > all of the worker sessions expired at the same time >> >> I'm assuming you're saying that multiple workers had state :timed-out at >> once? Was that on the same host? If the state is :disallowed, that is >> perfectly normal when the reassignment happens, as I described earlier. >> >> - Erik >> >> On Sunday, May 1, 2016, Kevin Conaway <[email protected]> wrote: >> >>> Any tips on where to continue investigating or other metrics to capture? >>> >>> As i mentioned before, the topology was mostly idle. Low cpu usage, low >>> gc time (cms parnew), stable heap, no eth errors. Its hard to see why all >>> of the worker sessions expired at the same time >>> >>> On Sunday, May 1, 2016, Erik Weathers <[email protected]> wrote: >>> >>>> To be clear, the supervisor wasn't shutting itself down, it was >>>> killing the worker process. >>>> >>>> Also for clarity, the :disallowed state simply means a new >>>> task->worker assignment has been computed by the nimbus and so the >>>> supervisor has noticed from the ZK state that the currently running worker >>>> is no longer supposed to be running. >>>> >>>> Regarding the :timed-out state it *definitely* means the worker was >>>> hung and couldn't write to the local heartbeat file within the timeout (30 >>>> secs by default). As for *why*... That is hard to determine. >>>> >>>> - Erik >>>> >>>> On Sunday, May 1, 2016, Kevin Conaway <[email protected]> >>>> wrote: >>>> >>>>> Thanks Erik. We're using Storm 0.10 so Pacemaker doesn't come in to >>>>> play here. >>>>> >>>>> 3. Worker process died due to exception (this is almost always what we >>>>> see) >>>>> 4. Worker process hung (eg doing GC). >>>>> >>>>> I don't think its either of these, TBH. There were no abnormal >>>>> terminations in the logs (or in the Storm error logs). I have the GC >>>>> metrics that were reported to graphite and the GC time was fairly low >>>>> (~50ms). >>>>> >>>>> My feeling is that it is Zookeeper related and not worker related >>>>> because it appears that ZK also expired sessions from the supervisor nodes >>>>> as well: >>>>> >>>>> I saw the following messages in the log of one of the ZK nodes: >>>>> >>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>> 0x45453e198e80076, timeout of 20000ms exceeded >>>>> >>>>> 2016-04-30 01:33:46,005 [myid:4] - INFO [ProcessThread(sid:4 >>>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination >>>>> for sessionid: 0x45453e198e80076 >>>>> >>>>> 2016-04-30 01:33:46,009 [myid:4] - INFO >>>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for >>>>> client /<ip>:44500 which had sessionid 0x45453e198e80076 >>>>> >>>>> And some corresponding log messages on one of the supervisor log files >>>>> (for the same ZK session ID). It looks like this occurred right as the >>>>> supervisor was shutting down due to Nimbus restarting it: >>>>> >>>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read >>>>> additional data from server sessionid 0x45453e198e80076, likely server has >>>>> closed socket, closing socket connection and attempting reconnect >>>>> >>>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and >>>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current >>>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs >>>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98 >>>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 >>>>> 74] >>>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700} >>>>> >>>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down >>>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b >>>>> >>>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user >>>>> 4e6934f9-88e4-4506-9d6e-25db72be941b >>>>> >>>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for >>>>> execution of cleanup threads on worker. >>>>> >>>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager >>>>> [INFO] State change: SUSPENDED >>>>> >>>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event >>>>> :disconnected::none: with disconnected Zookeeper. >>>>> >>>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket >>>>> connection to server <zk host>:2181. Will not attempt to authenticate >>>>> using >>>>> SASL (unknown error) >>>>> >>>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket >>>>> connection established to <zk host>:2181, initiating session >>>>> >>>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to >>>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired, >>>>> closing socket connection >>>>> >>>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager >>>>> [INFO] State change: LOST >>>>> >>>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event >>>>> :expired::none: with disconnected Zookeeper. >>>>> >>>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session >>>>> expired event received >>>>> >>>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <[email protected]> >>>>> wrote: >>>>> >>>>>> (I don't know anything about the pacemaker service introduced in >>>>>> storm 1.0, so this statement is pre-1.0). >>>>>> >>>>>> The executor threads within the worker processes write heartbeats to >>>>>> ZooKeeper. If they aren't successfully heartbeating then it could be >>>>>> many >>>>>> things: >>>>>> >>>>>> 1. ZK too busy? (Seems unlikely) >>>>>> 2. Network too busy? (Seems unlikely) >>>>>> 3. Worker process died due to exception (this is almost always what >>>>>> we see) >>>>>> 4. Worker process hung (eg doing GC). (This would usually first be >>>>>> caught by the supervisor on that host since it checks a local hearbeat >>>>>> file >>>>>> that the worker normally writes to every second -- if the heartbeat >>>>>> doesn't >>>>>> get refreshed before the timeout then the supervisor kills the worker >>>>>> process with State being :timed-out (or :time-out, something like that). >>>>>> This of course depends on the various timeout config values you have on >>>>>> the >>>>>> worker and nimbus hosts.) >>>>>> >>>>>> - Erik >>>>>> >>>>>> >>>>>> On Saturday, April 30, 2016, Kevin Conaway <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to >>>>>>> restart our topology. From researching past threads it seems like this >>>>>>> is >>>>>>> related to not receiving heartbeats from the supervisors but I'm unsure >>>>>>> if >>>>>>> this was the case. Our topology was mostly idle at the time that the >>>>>>> restart was triggered. >>>>>>> >>>>>>> We have a 5 node Zookeeper (3.4.5) ensemble. On one of the ZK >>>>>>> nodes, I saw the following messages at the time of the restart: >>>>>>> >>>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO >>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded >>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded >>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO >>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded >>>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO >>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded >>>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO >>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>> 0x15453e198d10084, timeout of 20000ms exceeded >>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO >>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded >>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO >>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session >>>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded >>>>>>> >>>>>>> In the nimbus log, there was the following log message: >>>>>>> >>>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 >>>>>>> 8] not alive >>>>>>> >>>>>>> Shortly thereafter, the supervisors started restarting the workers. >>>>>>> The following log message was in the supervisor log: >>>>>>> >>>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and >>>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current >>>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs >>>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 >>>>>>> 51] >>>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 >>>>>>> 87] >>>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700} >>>>>>> >>>>>>> Previous threads have suggested that this was due to heavy GC >>>>>>> causing the heartbeats not to reach Zookeeper but the topology was idle >>>>>>> at >>>>>>> this time so I don't think GC was the culprit. The GC par new time was >>>>>>> about 50ms on each node (as reported to Graphite). >>>>>>> >>>>>>> Thoughts on what could have happened here and how to debug further? >>>>>>> >>>>>>> -- >>>>>>> Kevin Conaway >>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>>>>>> https://github.com/kevinconaway >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Kevin Conaway >>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>>>> https://github.com/kevinconaway >>>>> >>>> >>> >>> -- >>> Kevin Conaway >>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>> https://github.com/kevinconaway >>> >>> > > > -- > Kevin Conaway > http://www.linkedin.com/pub/kevin-conaway/7/107/580/ > https://github.com/kevinconaway > -- Kevin Conaway http://www.linkedin.com/pub/kevin-conaway/7/107/580/ https://github.com/kevinconaway
