Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Fang Chen Sun, 14 Jun 2015 23:05:58 -0700

Thanks Derek!  I just tried forceSync=no, but it did not make any
difference.


Fang


On Sat, Jun 13, 2015 at 11:23 AM, Derek Dagit <[email protected]> wrote:

> It sounds like a ZooKeeper issue to me.
>
> Some things we have seen before:
>
> 1. Too much write load on the disks from ZK in general.
> 2. ZK distribution comes with a tool that will clean-up/purge old files.
> When ZK purges its edit logs/data logs, it can remove several large files.
> On a file-system like EXT, these deletions create non-trivial disk load and
> can block everything else going on until the purge is done.
>
>
> For 1. We run production with one work-around that revans2 found that
> *significantly* helps:
>
> -Dzookeeper.forceSync=no
>
> Add this to the JVM arguments when launching each of the ZK servers in
> your cluster.  Normally, each write is written to disk in order to persist
> the changes.  But in Storm, we figured this data really does not need to be
> persisted so aggressively.  If enough ZK nodes go down at once, data can be
> lost.  In practice this is a risk that we think we can take.
>
> After using the work-around for 1., 2. did not matter so much, but we
> still have a tool that spaces out the deletes based on disk performance.
>
> --
> Derek
>
>   ------------------------------
>  *From:* Erik Weathers <[email protected]>
> *To:* "[email protected]" <[email protected]>
> *Sent:* Saturday, June 13, 2015 1:52 AM
> *Subject:* Re: Has anybody successfully run storm 0.9+ in production
> under reasonable load?
>
> There is something fundamentally wrong.  You need to get to the root cause
> of what the worker process is doing that is preventing the heartbeats from
> arriving.
>
> - Erik
>
> On Friday, June 12, 2015, Fang Chen <[email protected]> wrote:
>
>
> I tuned up all worker timeout and task time out to 600 seconds, and seems
> like nimbus is happy about it after running the topology for 40minutes. But
> still one supervisor complained timeout from worker and then shut it down:
>
> 2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and
> clearing state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current
> supervisor time: 1434153560. State: :timed-out, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959,
> :storm-id "asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11]
> [16 16] [21 21] [26 26] [-1 -1] [1 1]}, :port 6703}
>
>
> It's so hard to believe that even 600 seconds is not enough.
>
> On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <[email protected]> wrote:
>
> I turned on debug and seems like the nimbus reassign was indeed caused by
> heartbeat timeouts after running the topology for about 20 minutes. You can
> see that those non-live executors have a ":is-timed-out true"  status and
> executor reported time is about 100 second behind nimbus time, while other
> live executors have executor time head of nimbus time.
>
>
> ==========
>
> Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [3 3] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [4 4]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146479}, [5 5] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [6 6] {:is-timed-out true,
> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [7 7]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [8 8] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [9 9] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [10 10]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}, [12 12] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146479}, [15 15] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [16 16]
> {:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time
> 1434146250}, [17 17] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [18 18] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [19 19]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146479}, [20 20] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [21 21] {:is-timed-out true,
> :nimbus-time 1434146355, :executor-reported-time 1434146250}, [22 22]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 1434146475,
> :executor-reported-time 1434146480}, [24 24] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}, [27 27] {:is-timed-out false,
> :nimbus-time 1434146475, :executor-reported-time 1434146480}, [28 28]
> {:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time
> 1434146480}, [1 1] {:is-timed-out true, :nimbus-time 1434146355,
> :executor-reported-time 1434146250}}
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[6 6] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[11 11] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[16 16] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[21 21] not alive
> 2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[26 26] not alive
> 2015-06-12T22:01:15.998+0000 b.s.d.nimbus [INFO] Executor
> asyncVarGenTopology-1-1434145217:[1 1] not alive
>
>
> On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <[email protected]> wrote:
>
> supervisor.heartbeat.frequency.secs 5
> supervisor.monitor.frequency.secs 3
>
> task.heartbeat.frequency.secs 3
> worker.heartbeat.frequency.secs 1
>
> some nimbus parameters:
>
> nimbus.monitor.freq.secs 120
> nimbus.reassign true
> nimbus.supervisor.timeout.secs 60
> nimbus.task.launch.secs 120
> nimbus.task.timeout.secs 30
>
> When worker dies, the log in one of supervisors shows shutting down worker
> with  state of disallowed (which I googled around and some people say it's
> due to nimbus reassign). Other logs only show shutting down worker without
> any further information.
>
>
> On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <[email protected]>
> wrote:
>
> I'll have to look later, I think we are using ZooKeeper v3.3.6 (something
> like that).  Some clusters have 3 ZK hosts, some 5.
>
> The way the nimbus detects that the executors are not alive is by not
> seeing heartbeats updated in ZK.  There has to be some cause for the
> heartbeats not being updated.  Most likely one is that the worker
> process is dead.  Another one could be that the process is too busy Garbage
> Collecting, and so missed the timeout for updating the heartbeat.
>
> Regarding Supervisor and Worker: I think it's normal for the worker to be
> able to live absent the presence of the supervisor, so that sounds like
> expected behavior.
>
> What are your timeouts for the various heartbeats?
>
> Also, when the worker dies you should see a log from the supervisor
> noticing it.
>
> - Erik
>
> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>
> Hi Erik
>
> Thanks for your reply!  It's great to hear about real production usages.
> For our use case, we are really puzzled by the outcome so far. The initial
> investigation seems to indicate that workers don't die by themselves ( i
> actually tried killing the supervisor and the worker would continue running
> beyond 30 minutes).
>
> The sequence of events is like this:  supervisor immediately complains
> worker "still has not started" for a few seconds right after launching the
> worker process, then silent --> after 26 minutes, nimbus complains
> executors (related to the worker) "not alive" and started to reassign
> topology --> after another ~500 milliseconds, the supervisor shuts down its
> worker --> other peer workers complain about netty issues. and the loop
> goes on.
>
> Could you kindly tell me what version of zookeeper is used with 0.9.4? and
> how many nodes in the zookeeper cluster?
>
> I wonder if this is due to zookeeper issues.
>
> Thanks a lot,
> Fang
>
>
>
> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]>
> wrote:
>
> Hey Fang,
>
> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
> 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
>
> One of the challenges with storm is figuring out what the root cause is
> when things go haywire.  You'll wanna examine why the nimbus decided to
> restart your worker processes.  It would happen when workers die and the
> nimbus notices that storm executors aren't alive.  (There are logs in
> nimbus for this.)  Then you'll wanna dig into why the workers died by
> looking at logs on the worker hosts.
>
> - Erik
>
>
> On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:
>
> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried
> 0.9.5 yet but I don't see any significant differences there), and
> unfortunately we could not even have a clean run for over 30 minutes on a
> cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on
> different disks.
>
> I have huge troubles to give my data analytics topology a stable run. So I
> tried the simplest topology I can think of, just an emtpy bolt, no io
> except for reading from kafka queue.
>
> Just to report my latest testing on 0.9.4 with this empty bolt (kakfa
> topic partition=1, spout task #=1, bolt #=20 with field grouping, msg
> size=1k).
> After 26 minutes, nimbus orders to kill the topology as it believe the
> topology is dead, then after another 2 minutes, another kill, then another
> after another 4 minutes, and on and on.
>
> I can understand there might be issues in the coordination among nimbus,
> worker and executor (e.g., heartbeats). But are there any doable
> workarounds? I wish there are as so many of you are using it in production
> :-)
>
> I deeply appreciate any suggestions that could even make my toy topology
> working!
>
> Fang
>
>
>
>
>
>
>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Reply via email to