Make sure your topology is starting up in the allotted time, and if not try
increasing the startup timeout.
On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote:

> Hi Erik
>
> Thanks for your reply!  It's great to hear about real production usages.
> For our use case, we are really puzzled by the outcome so far. The initial
> investigation seems to indicate that workers don't die by themselves ( i
> actually tried killing the supervisor and the worker would continue running
> beyond 30 minutes).
>
> The sequence of events is like this:  supervisor immediately complains
> worker "still has not started" for a few seconds right after launching the
> worker process, then silent --> after 26 minutes, nimbus complains
> executors (related to the worker) "not alive" and started to reassign
> topology --> after another ~500 milliseconds, the supervisor shuts down its
> worker --> other peer workers complain about netty issues. and the loop
> goes on.
>
> Could you kindly tell me what version of zookeeper is used with 0.9.4? and
> how many nodes in the zookeeper cluster?
>
> I wonder if this is due to zookeeper issues.
>
> Thanks a lot,
> Fang
>
>
>
> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]>
> wrote:
>
>> Hey Fang,
>>
>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>
>> One of the challenges with storm is figuring out what the root cause is
>> when things go haywire.  You'll wanna examine why the nimbus decided to
>> restart your worker processes.  It would happen when workers die and the
>> nimbus notices that storm executors aren't alive.  (There are logs in
>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>> looking at logs on the worker hosts.
>>
>> - Erik
>>
>>
>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>
>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>> 0.9.5 yet but I don't see any significant differences there), and
>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
>>> different disks.
>>>
>>> I have huge troubles to give my data analytics topology a stable run. So
>>> I tried the simplest topology I can think of, just an emtpy bolt, no io
>>> except for reading from kafka queue.
>>>
>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>> size=1k).
>>> After 26 minutes, nimbus orders to kill the topology as it believe the
>>> topology is dead, then after another 2 minutes, another kill, then another
>>> after another 4 minutes, and on and on.
>>>
>>> I can understand there might be issues in the coordination among nimbus,
>>> worker and executor (e.g., heartbeats). But are there any doable
>>> workarounds? I wish there are as so many of you are using it in production
>>> :-)
>>>
>>> I deeply appreciate any suggestions that could even make my toy topology
>>> working!
>>>
>>> Fang
>>>
>>>
>

Reply via email to