Hi Fang, Did you check your GC log? Do you see anything abnormal? What is your current max spout pending setting?
On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <[email protected]> wrote: > I also did this and find no success. > > Thanks, > Fang > > On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <[email protected]> wrote: > >> After I wrote that I realized you tried empty topology anyways. This >> should reduce any gc or worker initialization related failures though they >> are still possible. As Erik mentioned check ZK. Also I'm not sure if this >> is still required but it used to be helpful to make sure your storm nodes >> have each other listed in /etc/hosts. >> On Jun 12, 2015 8:59 AM, "Nathan Leung" <[email protected]> wrote: >> >>> Make sure your topology is starting up in the allotted time, and if not >>> try increasing the startup timeout. >>> On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote: >>> >>>> Hi Erik >>>> >>>> Thanks for your reply! It's great to hear about real production >>>> usages. For our use case, we are really puzzled by the outcome so far. The >>>> initial investigation seems to indicate that workers don't die by >>>> themselves ( i actually tried killing the supervisor and the worker would >>>> continue running beyond 30 minutes). >>>> >>>> The sequence of events is like this: supervisor immediately complains >>>> worker "still has not started" for a few seconds right after launching the >>>> worker process, then silent --> after 26 minutes, nimbus complains >>>> executors (related to the worker) "not alive" and started to reassign >>>> topology --> after another ~500 milliseconds, the supervisor shuts down its >>>> worker --> other peer workers complain about netty issues. and the loop >>>> goes on. >>>> >>>> Could you kindly tell me what version of zookeeper is used with 0.9.4? >>>> and how many nodes in the zookeeper cluster? >>>> >>>> I wonder if this is due to zookeeper issues. >>>> >>>> Thanks a lot, >>>> Fang >>>> >>>> >>>> >>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]> >>>> wrote: >>>> >>>>> Hey Fang, >>>>> >>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm >>>>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes. >>>>> >>>>> One of the challenges with storm is figuring out what the root cause >>>>> is when things go haywire. You'll wanna examine why the nimbus decided to >>>>> restart your worker processes. It would happen when workers die and the >>>>> nimbus notices that storm executors aren't alive. (There are logs in >>>>> nimbus for this.) Then you'll wanna dig into why the workers died by >>>>> looking at logs on the worker hosts. >>>>> >>>>> - Erik >>>>> >>>>> >>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>> >>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried >>>>>> 0.9.5 yet but I don't see any significant differences there), and >>>>>> unfortunately we could not even have a clean run for over 30 minutes on a >>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but >>>>>> on >>>>>> different disks. >>>>>> >>>>>> I have huge troubles to give my data analytics topology a stable run. >>>>>> So I tried the simplest topology I can think of, just an emtpy bolt, no >>>>>> io >>>>>> except for reading from kafka queue. >>>>>> >>>>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa >>>>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg >>>>>> size=1k). >>>>>> After 26 minutes, nimbus orders to kill the topology as it believe >>>>>> the topology is dead, then after another 2 minutes, another kill, then >>>>>> another after another 4 minutes, and on and on. >>>>>> >>>>>> I can understand there might be issues in the coordination among >>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable >>>>>> workarounds? I wish there are as so many of you are using it in >>>>>> production >>>>>> :-) >>>>>> >>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>> topology working! >>>>>> >>>>>> Fang >>>>>> >>>>>> >>>> >
