Make sure your topology is starting up in the allotted time, and if not try increasing the startup timeout. On Jun 12, 2015 2:46 AM, "Fang Chen" <[email protected]> wrote:
> Hi Erik > > Thanks for your reply! It's great to hear about real production usages. > For our use case, we are really puzzled by the outcome so far. The initial > investigation seems to indicate that workers don't die by themselves ( i > actually tried killing the supervisor and the worker would continue running > beyond 30 minutes). > > The sequence of events is like this: supervisor immediately complains > worker "still has not started" for a few seconds right after launching the > worker process, then silent --> after 26 minutes, nimbus complains > executors (related to the worker) "not alive" and started to reassign > topology --> after another ~500 milliseconds, the supervisor shuts down its > worker --> other peer workers complain about netty issues. and the loop > goes on. > > Could you kindly tell me what version of zookeeper is used with 0.9.4? and > how many nodes in the zookeeper cluster? > > I wonder if this is due to zookeeper issues. > > Thanks a lot, > Fang > > > > On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]> > wrote: > >> Hey Fang, >> >> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm >> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes. >> >> One of the challenges with storm is figuring out what the root cause is >> when things go haywire. You'll wanna examine why the nimbus decided to >> restart your worker processes. It would happen when workers die and the >> nimbus notices that storm executors aren't alive. (There are logs in >> nimbus for this.) Then you'll wanna dig into why the workers died by >> looking at logs on the worker hosts. >> >> - Erik >> >> >> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >> >>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried >>> 0.9.5 yet but I don't see any significant differences there), and >>> unfortunately we could not even have a clean run for over 30 minutes on a >>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on >>> different disks. >>> >>> I have huge troubles to give my data analytics topology a stable run. So >>> I tried the simplest topology I can think of, just an emtpy bolt, no io >>> except for reading from kafka queue. >>> >>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa >>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg >>> size=1k). >>> After 26 minutes, nimbus orders to kill the topology as it believe the >>> topology is dead, then after another 2 minutes, another kill, then another >>> after another 4 minutes, and on and on. >>> >>> I can understand there might be issues in the coordination among nimbus, >>> worker and executor (e.g., heartbeats). But are there any doable >>> workarounds? I wish there are as so many of you are using it in production >>> :-) >>> >>> I deeply appreciate any suggestions that could even make my toy topology >>> working! >>> >>> Fang >>> >>> >
