I also did this and find no success. Thanks, Fang
On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> wrote: > After I wrote that I realized you tried empty topology anyways. This > should reduce any gc or worker initialization related failures though they > are still possible. As Erik mentioned check ZK. Also I'm not sure if this > is still required but it used to be helpful to make sure your storm nodes > have each other listed in /etc/hosts. > On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote: > >> Make sure your topology is starting up in the allotted time, and if not >> try increasing the startup timeout. >> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote: >> >>> Hi Erik >>> >>> Thanks for your reply! It's great to hear about real production usages. >>> For our use case, we are really puzzled by the outcome so far. The initial >>> investigation seems to indicate that workers don't die by themselves ( i >>> actually tried killing the supervisor and the worker would continue running >>> beyond 30 minutes). >>> >>> The sequence of events is like this: supervisor immediately complains >>> worker "still has not started" for a few seconds right after launching the >>> worker process, then silent --> after 26 minutes, nimbus complains >>> executors (related to the worker) "not alive" and started to reassign >>> topology --> after another ~500 milliseconds, the supervisor shuts down its >>> worker --> other peer workers complain about netty issues. and the loop >>> goes on. >>> >>> Could you kindly tell me what version of zookeeper is used with 0.9.4? >>> and how many nodes in the zookeeper cluster? >>> >>> I wonder if this is due to zookeeper issues. >>> >>> Thanks a lot, >>> Fang >>> >>> >>> >>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]> >>> wrote: >>> >>>> Hey Fang, >>>> >>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm >>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes. >>>> >>>> One of the challenges with storm is figuring out what the root cause is >>>> when things go haywire. You'll wanna examine why the nimbus decided to >>>> restart your worker processes. It would happen when workers die and the >>>> nimbus notices that storm executors aren't alive. (There are logs in >>>> nimbus for this.) Then you'll wanna dig into why the workers died by >>>> looking at logs on the worker hosts. >>>> >>>> - Erik >>>> >>>> >>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>> >>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried >>>>> 0.9.5 yet but I don't see any significant differences there), and >>>>> unfortunately we could not even have a clean run for over 30 minutes on a >>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but >>>>> on >>>>> different disks. >>>>> >>>>> I have huge troubles to give my data analytics topology a stable run. >>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io >>>>> except for reading from kafka queue. >>>>> >>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa >>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg >>>>> size=1k). >>>>> After 26 minutes, nimbus orders to kill the topology as it believe the >>>>> topology is dead, then after another 2 minutes, another kill, then another >>>>> after another 4 minutes, and on and on. >>>>> >>>>> I can understand there might be issues in the coordination among >>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable >>>>> workarounds? I wish there are as so many of you are using it in production >>>>> :-) >>>>> >>>>> I deeply appreciate any suggestions that could even make my toy >>>>> topology working! >>>>> >>>>> Fang >>>>> >>>>> >>>
