I also believe something is going on but just can't find it out.

But I do observe in all my experiments, the first worker that started to
lose heartbeats is the one with kakfa spout task (I have only one spout
task). And when it happens, it seems like the whole worker process hangs,
none of the bolt tasks running in the same worker could spit out logging
messages.  I double-checked the spout creation process, there is nothing
suspicious since I don't really do anything except for using the
forceFromStart param.   This continues until the worker gets recycled by
supervisor after 600 seconds timeout.  Then I see another worker (on
another node) starts to lose heartbeats.

If I kill all supervisors, I can no longer observe the hang issue. It seems
as if the supervisor is also affecting the worker behavior (maybe by
reading heartbeats for zk??)

I am planning to re-arrange the cluster and see if that makes any
difference.




On Fri, Jun 12, 2015 at 11:52 PM, Erik Weathers <[email protected]>
wrote:

> There is something fundamentally wrong.  You need to get to the root cause
> of what the worker process is doing that is preventing the heartbeats from
> arriving.
>
> - Erik
>
>
> On Friday, June 12, 2015, Fang Chen <[email protected]> wrote:
>
>> I tuned up all worker timeout and task time out to 600 seconds, and seems
>> like nimbus is happy about it after running the topology for 40minutes. But
>> still one supervisor complained timeout from worker and then shut it down:
>>
>> 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current
>> supervisor time: 1434153560. State: :timed-out, Heartbeat:
>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959,
>> :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11]
>> [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}
>>
>>
>> It's so hard to believe that even 600 seconds is not enough.
>>
>> On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <[email protected]> wrote:
>>
>>> I turned on debug and seems like the nimbus reassign was indeed caused
>>> by heartbeat timeouts after running the topology for about 20 minutes. You
>>> can see that those non-live executors have a ":is-timed-out true"  status
>>> and executor reported time is about 100 second behind nimbus time, while
>>> other live executors have executor time head of nimbus time.
>>>
>>>
>>> ==========
>>>
>>> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
>>> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
>>> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
>>> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
>>> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
>>> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
>>> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
>>> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
>>> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
>>> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
>>> :executor-reported-time 1434146250}}
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[6 6] not alive
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[11 11] not alive
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[16 16] not alive
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[21 21] not alive
>>> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[26 26] not alive
>>> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
>>> asyncVarGenTopology-1-1434145217:[1 1] not alive
>>>
>>>
>>> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <[email protected]> wrote:
>>>
>>>> supervisor.heartbeat.frequency.secs 5
>>>> supervisor.monitor.frequency.secs 3
>>>>
>>>> task.heartbeat.frequency.secs 3
>>>> worker.heartbeat.frequency.secs 1
>>>>
>>>> some nimbus parameters:
>>>>
>>>> nimbus.monitor.freq.secs 120
>>>> nimbus.reassign true
>>>> nimbus.supervisor.timeout.secs 60
>>>> nimbus.task.launch.secs 120
>>>> nimbus.task.timeout.secs 30
>>>>
>>>> When worker dies, the log in one of supervisors shows shutting down
>>>> worker with  state of disallowed (which I googled around and some people
>>>> say it's due to nimbus reassign). Other logs only show shutting down worker
>>>> without any further information.
>>>>
>>>>
>>>> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <[email protected]>
>>>> wrote:
>>>>
>>>>> I'll have to look later, I think we are using ZooKeeper v3.3.6
>>>>> (something like that).  Some clusters have 3 ZK hosts, some 5.
>>>>>
>>>>> The way the nimbus detects that the executors are not alive is by not
>>>>> seeing heartbeats updated in ZK.  There has to be some cause for the
>>>>> heartbeats not being updated.  Most likely one is that the worker
>>>>> process is dead.  Another one could be that the process is too busy 
>>>>> Garbage
>>>>> Collecting, and so missed the timeout for updating the heartbeat.
>>>>>
>>>>> Regarding Supervisor and Worker: I think it's normal for the worker to
>>>>> be able to live absent the presence of the supervisor, so that sounds like
>>>>> expected behavior.
>>>>>
>>>>> What are your timeouts for the various heartbeats?
>>>>>
>>>>> Also, when the worker dies you should see a log from the supervisor
>>>>> noticing it.
>>>>>
>>>>> - Erik
>>>>>
>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>>>
>>>>>> Hi Erik
>>>>>>
>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>> usages. For our use case, we are really puzzled by the outcome so far. 
>>>>>> The
>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>> continue running beyond 30 minutes).
>>>>>>
>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor 
>>>>>> shuts
>>>>>> down its worker --> other peer workers complain about netty issues. and 
>>>>>> the
>>>>>> loop goes on.
>>>>>>
>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>
>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>
>>>>>> Thanks a lot,
>>>>>> Fang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hey Fang,
>>>>>>>
>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>>> nodes.
>>>>>>>
>>>>>>> One of the challenges with storm is figuring out what the root cause
>>>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided 
>>>>>>> to
>>>>>>> restart your worker processes.  It would happen when workers die and the
>>>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>> looking at logs on the worker hosts.
>>>>>>>
>>>>>>> - Erik
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>>>>>
>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes 
>>>>>>>> on a
>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes 
>>>>>>>> but on
>>>>>>>> different disks.
>>>>>>>>
>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy 
>>>>>>>> bolt,
>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>
>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field 
>>>>>>>> grouping,
>>>>>>>> msg size=1k).
>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>>
>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any 
>>>>>>>> doable
>>>>>>>> workarounds? I wish there are as so many of you are using it in 
>>>>>>>> production
>>>>>>>> :-)
>>>>>>>>
>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>> topology working!
>>>>>>>>
>>>>>>>> Fang
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>>

Reply via email to