Re: How Does Nimbus Decide to Restart Topology?

Kevin Conaway Mon, 02 May 2016 08:52:53 -0700

Some more interesting info.

As I mentioned in the last post, we have 6 workers.  Each worker has 2
slots.


Only one of the supervisors received a state :timed-out message.  After it
received the message, it restarted both workers successfully.

The other 5 supervisors all experienced the same issue as I mentioned in
the last post.  The workers were launched twice.  Whats interesting is that
both workers on those 5 nodes came up the second time in a hung state, they
all had a lot of ERROR messages from the netty client about not being able
to connect to the other nodes:

2016-04-30 01:34:59.758 b.s.m.n.Client [ERROR] connection attempt 26 to
Netty-Client-StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700
failed: java.net.ConnectException: Connection refused:
StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700
It looks like the bolts and spouts never got fully prepared (as evidenced
by lack of log messages indicating so)

On the one supervisor which _did_ receive the timeout message, the 2
workers came up just fine.

On Mon, May 2, 2016 at 11:21 AM, Kevin Conaway <[email protected]>
wrote:

> Unfortunately we're not capturing disk i/o in our metrics, I can look in
> to doing that for next time.
>
> We're not capturing GC logs, we are using the graphite storm metric
> consumer to push metrics to graphite, one of which is the GC time from the
> default GC mxbean.
>
> > I'm assuming you're saying that multiple workers had state :timed-out at
> once?
>
> We have 6 workers.  Only one had state :timed-out, the others had state
> :disallowed.  Looking at one of the supervisors as an example, it looks
> like it received multiple requests to reschedule the worker which caused
> the worker to be launched multiple times.  Is this normal?
>
> 2016-04-30 01:34:00.872 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 589b0ed5-c4e9-422f-a6c5-5d65145915f7. Current supervisor time:
> 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
> :storm-id "<topology>", :executors [[124 124] [64 64] [196 196] [40 40] [28
> 28] [184 184] [100 100] [-1 -1] [172 172] [16 16] [52 52] [148 148] [136
> 136] [112 112] [76 76] [88 88] [160 160] [4 4]], :port 6700}
>
> 2016-04-30 01:34:00.874 b.s.d.supervisor [INFO] Shutting down
> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:589b0ed5-c4e9-422f-a6c5-5d65145915f7
>
> 2016-04-30 01:34:02.013 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 460a7d66-06de-4ca5-9140-7d46dcdea841. Current supervisor time:
> 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
> :storm-id "<topology>", :executors [[178 178] [58 58] [190 190] [118 118]
> [22 22] [142 142] [-1 -1] [166 166] [106 106] [70 70] [10 10] [46 46] [82
> 82] [154 154] [94 94] [34 34] [130 130]], :port 6701}
>
> 2016-04-30 01:34:02.014 b.s.d.supervisor [INFO] Shutting down
> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:460a7d66-06de-4ca5-9140-7d46dcdea841
>
> 2016-04-30 01:34:03.095 b.s.d.supervisor [INFO] Launching worker with
> assignment {:storm-id "<topology>", :executors [[3 3] [33 33] [103 103]
> [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [153
> 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113
> 113]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port
> 6700 with id 567a18fd-33d1-49b6-a3f4-ace65641bd67
>
> 2016-04-30 01:34:03.122 b.s.d.supervisor [INFO] Launching worker with
> assignment {:storm-id "<topology>", :executors [[8 8] [188 188] [68 68]
> [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48
> 48] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78
> 78]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port 6701
> with id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6
>
>
> 2016-04-30 01:34:41.322 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 567a18fd-33d1-49b6-a3f4-ace65641bd67. Current supervisor time:
> 1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080,
> :storm-id "<topology>", :executors [[3 3] [33 33] [103 103] [163 163] [53
> 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [-1 -1] [153 153] [13
> 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113 113]],
> :port 6700}
>
> 2016-04-30 01:34:41.323 b.s.d.supervisor [INFO] Shutting down
> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:567a18fd-33d1-49b6-a3f4-ace65641bd67
>
> 2016-04-30 01:34:42.353 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6. Current supervisor time:
> 1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080,
> :storm-id "<topology>", :executors [[8 8] [188 188] [68 68] [198 198] [178
> 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48 48] [-1 -1] [148
> 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78 78]],
> :port 6701}
>
> 2016-04-30 01:34:42.354 b.s.d.supervisor [INFO] Shutting down
> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:9fcc869d-08d7-44ec-bde2-bf9ed86403e6
>
> On Sun, May 1, 2016 at 5:52 PM, Erik Weathers <[email protected]>
> wrote:
>
>> Maybe disk I/O was high?  Are you capturing GC logs to disk in unique
>> files (you can sub in the PID and timestamp into the GC log filename)?  I
>> know you believe it's not responsible, but it's the only thing I've ever
>> found to be responsible thus far.  (Except for a problem in storm 0.9.3
>> with netty that has since been fixed -- we worked around that by
>> downgrading to zero-MQ.)  You might try monitoring the heartbeat files
>> written by the workers to watch for the file creation to be happening less
>> frequently than once per second.
>>
>> > all of the worker sessions expired at the same time
>>
>> I'm assuming you're saying that multiple workers had state :timed-out at
>> once?  Was that on the same host?  If the state is :disallowed, that is
>> perfectly normal when the reassignment happens, as I described earlier.
>>
>> - Erik
>>
>> On Sunday, May 1, 2016, Kevin Conaway <[email protected]> wrote:
>>
>>> Any tips on where to continue investigating or other metrics to capture?
>>>
>>> As i mentioned before, the topology was mostly idle. Low cpu usage, low
>>> gc time (cms parnew), stable heap, no eth errors. Its hard to see why all
>>> of the worker sessions expired at the same time
>>>
>>> On Sunday, May 1, 2016, Erik Weathers <[email protected]> wrote:
>>>
>>>> To be clear, the supervisor wasn't shutting  itself down, it was
>>>> killing the worker process.
>>>>
>>>> Also for clarity, the :disallowed state simply means a new
>>>> task->worker assignment has been computed by the nimbus and so the
>>>> supervisor has noticed from the ZK state that the currently running worker
>>>> is no longer supposed to be running.
>>>>
>>>> Regarding the :timed-out state it *definitely* means the worker was
>>>> hung and couldn't write to the local heartbeat file within the timeout (30
>>>> secs by default).  As for *why*... That is hard to determine.
>>>>
>>>> - Erik
>>>>
>>>> On Sunday, May 1, 2016, Kevin Conaway <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to
>>>>> play here.
>>>>>
>>>>> 3. Worker process died due to exception (this is almost always what we
>>>>> see)
>>>>> 4. Worker process hung (eg doing GC).
>>>>>
>>>>> I don't think its either of these, TBH.  There were no abnormal
>>>>> terminations in the logs (or in the Storm error logs). I have the GC
>>>>> metrics that were reported to graphite and the GC time was fairly low
>>>>> (~50ms).
>>>>>
>>>>> My feeling is that it is Zookeeper related and not worker related
>>>>> because it appears that ZK also expired sessions from the supervisor nodes
>>>>> as well:
>>>>>
>>>>> I saw the following messages in the log of one of the ZK nodes:
>>>>>
>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>
>>>>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>>>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination
>>>>> for sessionid: 0x45453e198e80076
>>>>>
>>>>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>>>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
>>>>> client /<ip>:44500 which had sessionid 0x45453e198e80076
>>>>>
>>>>> And some corresponding log messages on one of the supervisor log files
>>>>> (for the same ZK session ID).  It looks like this occurred right as the
>>>>> supervisor was shutting down due to Nimbus restarting it:
>>>>>
>>>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
>>>>> additional data from server sessionid 0x45453e198e80076, likely server has
>>>>> closed socket, closing socket connection and attempting reconnect
>>>>>
>>>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>>>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>>>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>>>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98
>>>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 
>>>>> 74]
>>>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>>>>
>>>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>>>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>>
>>>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>>>>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>>
>>>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>>>>> execution of cleanup threads on worker.
>>>>>
>>>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager
>>>>> [INFO] State change: SUSPENDED
>>>>>
>>>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>>>>> :disconnected::none: with disconnected Zookeeper.
>>>>>
>>>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
>>>>> connection to server <zk host>:2181. Will not attempt to authenticate 
>>>>> using
>>>>> SASL (unknown error)
>>>>>
>>>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket
>>>>> connection established to <zk host>:2181, initiating session
>>>>>
>>>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>>>>> closing socket connection
>>>>>
>>>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager
>>>>> [INFO] State change: LOST
>>>>>
>>>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event
>>>>> :expired::none: with disconnected Zookeeper.
>>>>>
>>>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>>>>> expired event received
>>>>>
>>>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> (I don't know anything about the pacemaker service introduced in
>>>>>> storm 1.0, so this statement is pre-1.0).
>>>>>>
>>>>>> The executor threads within the worker processes write heartbeats to
>>>>>> ZooKeeper.  If they aren't successfully heartbeating then it could be 
>>>>>> many
>>>>>> things:
>>>>>>
>>>>>> 1. ZK too busy? (Seems unlikely)
>>>>>> 2. Network too busy? (Seems unlikely)
>>>>>> 3. Worker process died due to exception (this is almost always what
>>>>>> we see)
>>>>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>>>>> caught by the supervisor on that host since it checks a local hearbeat 
>>>>>> file
>>>>>> that the worker normally writes to every second -- if the heartbeat 
>>>>>> doesn't
>>>>>> get refreshed before the timeout then the supervisor kills the worker
>>>>>> process with State being :timed-out (or :time-out, something like that).
>>>>>> This of course depends on the various timeout config values you have on 
>>>>>> the
>>>>>> worker and nimbus hosts.)
>>>>>>
>>>>>> - Erik
>>>>>>
>>>>>>
>>>>>> On Saturday, April 30, 2016, Kevin Conaway <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to
>>>>>>> restart our topology.  From researching past threads it seems like this 
>>>>>>> is
>>>>>>> related to not receiving heartbeats from the supervisors but I'm unsure 
>>>>>>> if
>>>>>>> this was the case.  Our topology was mostly idle at the time that the
>>>>>>> restart was triggered.
>>>>>>>
>>>>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK
>>>>>>> nodes, I saw the following messages at the time of the restart:
>>>>>>>
>>>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>>>>
>>>>>>> In the nimbus log, there was the following log message:
>>>>>>>
>>>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8
>>>>>>> 8] not alive
>>>>>>>
>>>>>>> Shortly thereafter, the supervisors started restarting the workers.
>>>>>>> The  following log message was in the supervisor log:
>>>>>>>
>>>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 
>>>>>>> 51]
>>>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 
>>>>>>> 87]
>>>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>>>>
>>>>>>> Previous threads have suggested that this was due to heavy GC
>>>>>>> causing the heartbeats not to reach Zookeeper but the topology was idle 
>>>>>>> at
>>>>>>> this time so I don't think GC was the culprit.  The GC par new time was
>>>>>>> about 50ms on each node (as reported to Graphite).
>>>>>>>
>>>>>>> Thoughts on what could have happened here and how to debug further?
>>>>>>>
>>>>>>> --
>>>>>>> Kevin Conaway
>>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>>> https://github.com/kevinconaway
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kevin Conaway
>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>> https://github.com/kevinconaway
>>>>>
>>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>



-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: How Does Nimbus Decide to Restart Topology?

Reply via email to