supervisor.heartbeat.frequency.secs 5 supervisor.monitor.frequency.secs 3 task.heartbeat.frequency.secs 3 worker.heartbeat.frequency.secs 1
some nimbus parameters: nimbus.monitor.freq.secs 120 nimbus.reassign true nimbus.supervisor.timeout.secs 60 nimbus.task.launch.secs 120 nimbus.task.timeout.secs 30 When worker dies, the log in one of supervisors shows shutting down worker with state of disallowed (which I googled around and some people say it's due to nimbus reassign). Other logs only show shutting down worker without any further information. On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <[email protected]> wrote: > I'll have to look later, I think we are using ZooKeeper v3.3.6 (something > like that). Some clusters have 3 ZK hosts, some 5. > > The way the nimbus detects that the executors are not alive is by not > seeing heartbeats updated in ZK. There has to be some cause for the > heartbeats not being updated. Most likely one is that the worker > process is dead. Another one could be that the process is too busy Garbage > Collecting, and so missed the timeout for updating the heartbeat. > > Regarding Supervisor and Worker: I think it's normal for the worker to be > able to live absent the presence of the supervisor, so that sounds like > expected behavior. > > What are your timeouts for the various heartbeats? > > Also, when the worker dies you should see a log from the supervisor > noticing it. > > - Erik > > On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: > >> Hi Erik >> >> Thanks for your reply! It's great to hear about real production usages. >> For our use case, we are really puzzled by the outcome so far. The initial >> investigation seems to indicate that workers don't die by themselves ( i >> actually tried killing the supervisor and the worker would continue running >> beyond 30 minutes). >> >> The sequence of events is like this: supervisor immediately complains >> worker "still has not started" for a few seconds right after launching the >> worker process, then silent --> after 26 minutes, nimbus complains >> executors (related to the worker) "not alive" and started to reassign >> topology --> after another ~500 milliseconds, the supervisor shuts down its >> worker --> other peer workers complain about netty issues. and the loop >> goes on. >> >> Could you kindly tell me what version of zookeeper is used with 0.9.4? >> and how many nodes in the zookeeper cluster? >> >> I wonder if this is due to zookeeper issues. >> >> Thanks a lot, >> Fang >> >> >> >> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]> >> wrote: >> >>> Hey Fang, >>> >>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm >>> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes. >>> >>> One of the challenges with storm is figuring out what the root cause is >>> when things go haywire. You'll wanna examine why the nimbus decided to >>> restart your worker processes. It would happen when workers die and the >>> nimbus notices that storm executors aren't alive. (There are logs in >>> nimbus for this.) Then you'll wanna dig into why the workers died by >>> looking at logs on the worker hosts. >>> >>> - Erik >>> >>> >>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>> >>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried >>>> 0.9.5 yet but I don't see any significant differences there), and >>>> unfortunately we could not even have a clean run for over 30 minutes on a >>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on >>>> different disks. >>>> >>>> I have huge troubles to give my data analytics topology a stable run. >>>> So I tried the simplest topology I can think of, just an emtpy bolt, no io >>>> except for reading from kafka queue. >>>> >>>> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa >>>> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg >>>> size=1k). >>>> After 26 minutes, nimbus orders to kill the topology as it believe the >>>> topology is dead, then after another 2 minutes, another kill, then another >>>> after another 4 minutes, and on and on. >>>> >>>> I can understand there might be issues in the coordination among >>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable >>>> workarounds? I wish there are as so many of you are using it in production >>>> :-) >>>> >>>> I deeply appreciate any suggestions that could even make my toy >>>> topology working! >>>> >>>> Fang >>>> >>>> >>
