Just to be sure, are you using Storm or Storm Trident?
Also can you share the current setting of your supervisor.child_opts?

On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <[email protected]> wrote:

> I did enable gc for both worker and supervisor and found nothing abnormal
> (pause is minimal and frequency is normal too).  I tried max spound pending
> of both 1000 and 500.
>
> Fang
>
> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <[email protected]>
> wrote:
>
>> Hi Fang,
>>
>> Did you check your GC log? Do you see anything abnormal?
>> What is your current max spout pending setting?
>>
>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote:
>>
>>> I also did this and find no success.
>>>
>>> Thanks,
>>> Fang
>>>
>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> wrote:
>>>
>>>> After I wrote that I realized you tried empty topology anyways.  This
>>>> should reduce any gc or worker initialization related failures though they
>>>> are still possible.  As Erik mentioned check ZK.  Also I'm not sure if this
>>>> is still required but it used to be helpful to make sure your storm nodes
>>>> have each other listed in /etc/hosts.
>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote:
>>>>
>>>>> Make sure your topology is starting up in the allotted time, and if
>>>>> not try increasing the startup timeout.
>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote:
>>>>>
>>>>>> Hi Erik
>>>>>>
>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>> usages. For our use case, we are really puzzled by the outcome so far. 
>>>>>> The
>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>> themselves ( i actually tried killing the supervisor and the worker would
>>>>>> continue running beyond 30 minutes).
>>>>>>
>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor 
>>>>>> shuts
>>>>>> down its worker --> other peer workers complain about netty issues. and 
>>>>>> the
>>>>>> loop goes on.
>>>>>>
>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>
>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>
>>>>>> Thanks a lot,
>>>>>> Fang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hey Fang,
>>>>>>>
>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+
>>>>>>> nodes.
>>>>>>>
>>>>>>> One of the challenges with storm is figuring out what the root cause
>>>>>>> is when things go haywire.  You'll wanna examine why the nimbus decided 
>>>>>>> to
>>>>>>> restart your worker processes.  It would happen when workers die and the
>>>>>>> nimbus notices that storm executors aren't alive.  (There are logs in
>>>>>>> nimbus for this.)  Then you'll wanna dig into why the workers died by
>>>>>>> looking at logs on the worker hosts.
>>>>>>>
>>>>>>> - Erik
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>>>>>>>
>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and
>>>>>>>> unfortunately we could not even have a clean run for over 30 minutes 
>>>>>>>> on a
>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes 
>>>>>>>> but on
>>>>>>>> different disks.
>>>>>>>>
>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy 
>>>>>>>> bolt,
>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>
>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field 
>>>>>>>> grouping,
>>>>>>>> msg size=1k).
>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe
>>>>>>>> the topology is dead, then after another 2 minutes, another kill, then
>>>>>>>> another after another 4 minutes, and on and on.
>>>>>>>>
>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any 
>>>>>>>> doable
>>>>>>>> workarounds? I wish there are as so many of you are using it in 
>>>>>>>> production
>>>>>>>> :-)
>>>>>>>>
>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>> topology working!
>>>>>>>>
>>>>>>>> Fang
>>>>>>>>
>>>>>>>>
>>>>>>
>>>
>>
>

Reply via email to