Hi Fang,

Did you check your GC log? Do you see anything abnormal?
What is your current max spout pending setting?

On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote:

> I also did this and find no success.
>
> Thanks,
> Fang
>
> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> wrote:
>
>> After I wrote that I realized you tried empty topology anyways.  This
>> should reduce any gc or worker initialization related failures though they
>> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
>> is still required but it used to be helpful to make sure your storm nodes
>> have each other listed in /etc/hosts.
>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote:
>>
>>> Make sure your topology is starting up in the allotted time, and if not
>>> try increasing the startup timeout.
>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote:
>>>
>>>> Hi Erik
>>>>
>>>> Thanks for your reply!  It's great to hear about real production
>>>> usages. For our use case, we are really puzzled by the outcome so far. The
>>>> initial investigation seems to indicate that workers don't die by
>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>> continue running beyond 30 minutes).
>>>>
>>>> The sequence of events is like this:  supervisor immediately complains
>>>> worker "still has not started" for a few seconds right after launching the
>>>> worker process, then silent --> after 26 minutes, nimbus complains
>>>> executors (related to the worker) "not alive" and started to reassign
>>>> topology --> after another ~500 milliseconds, the supervisor shuts down its
>>>> worker --> other peer workers complain about netty issues. and the loop
>>>> goes on.
>>>>
>>>> Could you kindly tell me what version of zookeeper is used with 0.9.4?
>>>> and how many nodes in the zookeeper cluster?
>>>>
>>>> I wonder if this is due to zookeeper issues.
>>>>
>>>> Thanks a lot,
>>>> Fang
>>>>
>>>>
>>>>
>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]>
>>>> wrote:
>>>>
>>>>> Hey Fang,
>>>>>
>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
>>>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>>>>>
>>>>> One of the challenges with storm is figuring out what the root cause
>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided to
>>>>> restart your worker processes.  It would happen when workers die and the
>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>> looking at logs on the worker hosts.
>>>>>
>>>>> - Erik
>>>>>
>>>>>
>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>>>
>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
>>>>>> 0.9.5 yet but I don't see any significant differences there), and
>>>>>> unfortunately we could not even have a clean run for over 30 minutes on a
>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but 
>>>>>> on
>>>>>> different disks.
>>>>>>
>>>>>> I have huge troubles to give my data analytics topology a stable run.
>>>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no 
>>>>>> io
>>>>>> except for reading from kafka queue.
>>>>>>
>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
>>>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
>>>>>> size=1k).
>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>> another after another 4 minutes, and on and on.
>>>>>>
>>>>>> I can understand there might be issues in the coordination among
>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable
>>>>>> workarounds? I wish there are as so many of you are using it in 
>>>>>> production
>>>>>> :-)
>>>>>>
>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>> topology working!
>>>>>>
>>>>>> Fang
>>>>>>
>>>>>>
>>>>
>

Reply via email to