There is something fundamentally wrong. You need to get to the root cause of what the worker process is doing that is preventing the heartbeats from arriving.
- Erik On Friday, June 12, 2015, Fang Chen <[email protected]> wrote: > I tuned up all worker timeout and task time out to 600 seconds, and seems > like nimbus is happy about it after running the topology for 40minutes. But > still one supervisor complained timeout from worker and then shut it down: > > 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and > clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current > supervisor time: 1434153560. State: :timed-out, Heartbeat: > #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959, > :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11] > [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703} > > > It's so hard to believe that even 600 seconds is not enough. > > On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> I turned on debug and seems like the nimbus reassign was indeed caused by >> heartbeat timeouts after running the topology for about 20 minutes. You can >> see that those non-live executors have a ":is-timed-out true" status and >> executor reported time is about 100 second behind nimbus time, while other >> live executors have executor time head of nimbus time. >> >> >> ========== >> >> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475, >> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false, >> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4] >> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475, >> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true, >> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7] >> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475, >> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false, >> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10] >> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355, >> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false, >> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13] >> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475, >> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false, >> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16] >> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time >> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475, >> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false, >> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19] >> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475, >> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true, >> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22] >> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475, >> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false, >> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25] >> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355, >> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false, >> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28] >> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time >> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355, >> :executor-reported-time 1434146250}} >> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >> asyncVarGenTopology-1-1434145217:[6 6] not alive >> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >> asyncVarGenTopology-1-1434145217:[11 11] not alive >> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >> asyncVarGenTopology-1-1434145217:[16 16] not alive >> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >> asyncVarGenTopology-1-1434145217:[21 21] not alive >> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor >> asyncVarGenTopology-1-1434145217:[26 26] not alive >> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor >> asyncVarGenTopology-1-1434145217:[1 1] not alive >> >> >> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >>> supervisor.heartbeat.frequency.secs 5 >>> supervisor.monitor.frequency.secs 3 >>> >>> task.heartbeat.frequency.secs 3 >>> worker.heartbeat.frequency.secs 1 >>> >>> some nimbus parameters: >>> >>> nimbus.monitor.freq.secs 120 >>> nimbus.reassign true >>> nimbus.supervisor.timeout.secs 60 >>> nimbus.task.launch.secs 120 >>> nimbus.task.timeout.secs 30 >>> >>> When worker dies, the log in one of supervisors shows shutting down >>> worker with state of disallowed (which I googled around and some people >>> say it's due to nimbus reassign). Other logs only show shutting down worker >>> without any further information. >>> >>> >>> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>>> I'll have to look later, I think we are using ZooKeeper v3.3.6 >>>> (something like that). Some clusters have 3 ZK hosts, some 5. >>>> >>>> The way the nimbus detects that the executors are not alive is by not >>>> seeing heartbeats updated in ZK. There has to be some cause for the >>>> heartbeats not being updated. Most likely one is that the worker >>>> process is dead. Another one could be that the process is too busy Garbage >>>> Collecting, and so missed the timeout for updating the heartbeat. >>>> >>>> Regarding Supervisor and Worker: I think it's normal for the worker to >>>> be able to live absent the presence of the supervisor, so that sounds like >>>> expected behavior. >>>> >>>> What are your timeouts for the various heartbeats? >>>> >>>> Also, when the worker dies you should see a log from the supervisor >>>> noticing it. >>>> >>>> - Erik >>>> >>>> On Thursday, June 11, 2015, Fang Chen <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>> >>>>> Hi Erik >>>>> >>>>> Thanks for your reply! It's great to hear about real production >>>>> usages. For our use case, we are really puzzled by the outcome so far. The >>>>> initial investigation seems to indicate that workers don't die by >>>>> themselves ( i actually tried killing the supervisor and the worker would >>>>> continue running beyond 30 minutes). >>>>> >>>>> The sequence of events is like this: supervisor immediately complains >>>>> worker "still has not started" for a few seconds right after launching the >>>>> worker process, then silent --> after 26 minutes, nimbus complains >>>>> executors (related to the worker) "not alive" and started to reassign >>>>> topology --> after another ~500 milliseconds, the supervisor shuts down >>>>> its >>>>> worker --> other peer workers complain about netty issues. and the loop >>>>> goes on. >>>>> >>>>> Could you kindly tell me what version of zookeeper is used with 0.9.4? >>>>> and how many nodes in the zookeeper cluster? >>>>> >>>>> I wonder if this is due to zookeeper issues. >>>>> >>>>> Thanks a lot, >>>>> Fang >>>>> >>>>> >>>>> >>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected] >>>>> > wrote: >>>>> >>>>>> Hey Fang, >>>>>> >>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and >>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of 30+ >>>>>> nodes. >>>>>> >>>>>> One of the challenges with storm is figuring out what the root cause >>>>>> is when things go haywire. You'll wanna examine why the nimbus decided >>>>>> to >>>>>> restart your worker processes. It would happen when workers die and the >>>>>> nimbus notices that storm executors aren't alive. (There are logs in >>>>>> nimbus for this.) Then you'll wanna dig into why the workers died by >>>>>> looking at logs on the worker hosts. >>>>>> >>>>>> - Erik >>>>>> >>>>>> >>>>>> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote: >>>>>> >>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not >>>>>>> tried 0.9.5 yet but I don't see any significant differences there), and >>>>>>> unfortunately we could not even have a clean run for over 30 minutes on >>>>>>> a >>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these nodes >>>>>>> but on >>>>>>> different disks. >>>>>>> >>>>>>> I have huge troubles to give my data analytics topology a stable >>>>>>> run. So I tried the simplest topology I can think of, just an emtpy >>>>>>> bolt, >>>>>>> no io except for reading from kafka queue. >>>>>>> >>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt >>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field grouping, >>>>>>> msg size=1k). >>>>>>> After 26 minutes, nimbus orders to kill the topology as it believe >>>>>>> the topology is dead, then after another 2 minutes, another kill, then >>>>>>> another after another 4 minutes, and on and on. >>>>>>> >>>>>>> I can understand there might be issues in the coordination among >>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any doable >>>>>>> workarounds? I wish there are as so many of you are using it in >>>>>>> production >>>>>>> :-) >>>>>>> >>>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>>> topology working! >>>>>>> >>>>>>> Fang >>>>>>> >>>>>>> >>>>> >>> >> >
