supervisor.heartbeat.frequency.secs 5
supervisor.monitor.frequency.secs 3

task.heartbeat.frequency.secs 3
worker.heartbeat.frequency.secs 1

some nimbus parameters:

nimbus.monitor.freq.secs 120
nimbus.reassign true
nimbus.supervisor.timeout.secs 60
nimbus.task.launch.secs 120
nimbus.task.timeout.secs 30

When worker dies, the log in one of supervisors shows shutting down worker
with  state of disallowed (which I googled around and some people say it's
due to nimbus reassign). Other logs only show shutting down worker without
any further information.


On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <[email protected]>
wrote:

> I'll have to look later, I think we are using ZooKeeper v3.3.6 (something
> like that).  Some clusters have 3 ZK hosts, some 5.
>
> The way the nimbus detects that the executors are not alive is by not
> seeing heartbeats updated in ZK.  There has to be some cause for the
> heartbeats not being updated.  Most likely one is that the worker
> process is dead.  Another one could be that the process is too busy Garbage
> Collecting, and so missed the timeout for updating the heartbeat.
>
> Regarding Supervisor and Worker: I think it's normal for the worker to be
> able to live absent the presence of the supervisor, so that sounds like
> expected behavior.
>
> What are your timeouts for the various heartbeats?
>
> Also, when the worker dies you should see a log from the supervisor
> noticing it.
>
> - Erik
>
> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>
>> Hi Erik
>>
>> Thanks for your reply!  It's great to hear about real production usages.
>> For our use case, we are really puzzled by the outcome so far. The initial
>> investigation seems to indicate that workers don't die by themselves ( i
>> actually tried killing the supervisor and the worker would continue running
>> beyond 30 minutes).
>>
>> The sequence of events is like this:  supervisor immediately complains
>> worker "still has not started" for a few seconds right after launching the
>> worker process, then silent --> after 26 minutes, nimbus complains
>> executors (related to the worker) "not alive" and started to reassign
>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>> worker --> other peer workers complain about netty issues. and the loop
>> goes on.
>>
>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>> and how many nodes in the zookeeper cluster?
>>
>> I wonder if this is due to zookeeper issues.
>>
>> Thanks a lot,
>> Fang
>>
>>
>>
>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]>
>> wrote:
>>
>>> Hey Fang,
>>>
>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>
>>> One of the challenges with storm is figuring out what the root cause is
>>> when things go haywire.  You'll wanna examine why the nimbus decided to
>>> restart your worker processes.  It would happen when workers die and the
>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>> looking at logs on the worker hosts.
>>>
>>> - Erik
>>>
>>>
>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>
>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>>> different disks.
>>>>
>>>> I have huge troubles to give my data analytics topology a stable run.
>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io
>>>> except for reading from kafka queue.
>>>>
>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>> size=1k).
>>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>>> topology is dead, then after another 2 minutes, another kill, then another
>>>> after another 4 minutes, and on and on.
>>>>
>>>> I can understand there might be issues in the coordination among
>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>> workarounds? I wish there are as so many of you are using it in production
>>>> :-)
>>>>
>>>> I deeply appreciate any suggestions that could even make my toy
>>>> topology working!
>>>>
>>>> Fang
>>>>
>>>>
>>

Reply via email to