It sounds like a ZooKeeper issue to me.
Some things we have seen before:
1. Too much write load on the disks from ZK in general.
2. ZK distribution comes with a tool that will clean-up/purge old files.  When 
ZK purges its edit logs/data logs, it can remove several large files.  On a 
file-system like EXT, these deletions create non-trivial disk load and can 
block everything else going on until the purge is done.


For 1. We run production with one work-around that revans2 found that 
*significantly* helps:
-Dzookeeper.forceSync=no
Add this to the JVM arguments when launching each of the ZK servers in your 
cluster.  Normally, each write is written to disk in order to persist the 
changes.  But in Storm, we figured this data really does not need to be 
persisted so aggressively.  If enough ZK nodes go down at once, data can be 
lost.  In practice this is a risk that we think we can take.

After using the work-around for 1., 2. did not matter so much, but we still 
have a tool that spaces out the deletes based on disk performance. -- 
Derek

      From: Erik Weathers <[email protected]>
 To: "[email protected]" <[email protected]> 
 Sent: Saturday, June 13, 2015 1:52 AM
 Subject: Re: Has anybody successfully run storm 0.9+ in production under 
reasonable load?
   
There is something fundamentally wrong.  You need to get to the root cause of 
what the worker process is doing that is preventing the heartbeats from 
arriving.
- Erik

On Friday, June 12, 2015, Fang Chen <[email protected]> wrote:



I tuned up all worker timeout and task time out to 600 seconds, and seems like 
nimbus is happy about it after running the topology for 40minutes. But still 
one supervisor complained timeout from worker and then shut it down:
2015-06-12T23:59:20.633+0000 b.s.d.supervisor [INFO] Shutting down and clearing 
state for id 0cbfe8e5-b41f-451b-9005-107cef9b9e28. Current supervisor time: 
1434153560. State: :timed-out, Heartbeat: 
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1434152959, :storm-id 
"asyncVarGenTopology-1-1434151135", :executors #{[6 6] [11 11] [16 16] [21 21] 
[26 26] [-1 -1] [1 1]}, :port 6703}


It's so hard to believe that even 600 seconds is not enough.
On Fri, Jun 12, 2015 at 3:27 PM, Fang Chen <[email protected]> wrote:

I turned on debug and seems like the nimbus reassign was indeed caused by 
heartbeat timeouts after running the topology for about 20 minutes. You can see 
that those non-live executors have a ":is-timed-out true"  status and executor 
reported time is about 100 second behind nimbus time, while other live 
executors have executor time head of nimbus time.

==========

Heartbeat cache: {[2 2] {:is-timed-out false, :nimbus-time 1434146475, 
:executor-reported-time 1434146480}, [3 3] {:is-timed-out false, :nimbus-time 
1434146475, :executor-reported-time 1434146480}, [4 4] {:is-timed-out false, 
:nimbus-time 1434146475, :executor-reported-time 1434146479}, [5 5] 
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 
1434146480}, [6 6] {:is-timed-out true, :nimbus-time 1434146355, 
:executor-reported-time 1434146250}, [7 7] {:is-timed-out false, :nimbus-time 
1434146475, :executor-reported-time 1434146480}, [8 8] {:is-timed-out false, 
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [9 9] 
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 
1434146479}, [10 10] {:is-timed-out false, :nimbus-time 1434146475, 
:executor-reported-time 1434146480}, [11 11] {:is-timed-out true, :nimbus-time 
1434146355, :executor-reported-time 1434146250}, [12 12] {:is-timed-out false, 
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [13 13] 
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 
1434146480}, [14 14] {:is-timed-out false, :nimbus-time 1434146475, 
:executor-reported-time 1434146479}, [15 15] {:is-timed-out false, :nimbus-time 
1434146475, :executor-reported-time 1434146480}, [16 16] {:is-timed-out true, 
:nimbus-time 1434146355, :executor-reported-time 1434146250}, [17 17] 
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 
1434146480}, [18 18] {:is-timed-out false, :nimbus-time 1434146475, 
:executor-reported-time 1434146480}, [19 19] {:is-timed-out false, :nimbus-time 
1434146475, :executor-reported-time 1434146479}, [20 20] {:is-timed-out false, 
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [21 21] 
{:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time 
1434146250}, [22 22] {:is-timed-out false, :nimbus-time 1434146475, 
:executor-reported-time 1434146480}, [23 23] {:is-timed-out false, :nimbus-time 
1434146475, :executor-reported-time 1434146480}, [24 24] {:is-timed-out false, 
:nimbus-time 1434146475, :executor-reported-time 1434146479}, [25 25] 
{:is-timed-out false, :nimbus-time 1434146475, :executor-reported-time 
1434146480}, [26 26] {:is-timed-out true, :nimbus-time 1434146355, 
:executor-reported-time 1434146250}, [27 27] {:is-timed-out false, :nimbus-time 
1434146475, :executor-reported-time 1434146480}, [28 28] {:is-timed-out false, 
:nimbus-time 1434146475, :executor-reported-time 1434146480}, [1 1] 
{:is-timed-out true, :nimbus-time 1434146355, :executor-reported-time 
1434146250}}2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor 
asyncVarGenTopology-1-1434145217:[6 6] not alive2015-06-12T22:01:15.997+0000 
b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[11 11] not 
alive2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor 
asyncVarGenTopology-1-1434145217:[16 16] not alive2015-06-12T22:01:15.997+0000 
b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[21 21] not 
alive2015-06-12T22:01:15.997+0000 b.s.d.nimbus [INFO] Executor 
asyncVarGenTopology-1-1434145217:[26 26] not alive2015-06-12T22:01:15.998+0000 
b.s.d.nimbus [INFO] Executor asyncVarGenTopology-1-1434145217:[1 1] not alive

On Fri, Jun 12, 2015 at 2:47 PM, Fang Chen <[email protected]> wrote:

supervisor.heartbeat.frequency.secs 5
supervisor.monitor.frequency.secs 3

task.heartbeat.frequency.secs 3
worker.heartbeat.frequency.secs 1

some nimbus parameters:
nimbus.monitor.freq.secs 120nimbus.reassign truenimbus.supervisor.timeout.secs 
60nimbus.task.launch.secs 120nimbus.task.timeout.secs 30
When worker dies, the log in one of supervisors shows shutting down worker with 
 state of disallowed (which I googled around and some people say it's due to 
nimbus reassign). Other logs only show shutting down worker without any further 
information.

On Fri, Jun 12, 2015 at 2:09 AM, Erik Weathers <[email protected]> wrote:

I'll have to look later, I think we are using ZooKeeper v3.3.6 (something like 
that).  Some clusters have 3 ZK hosts, some 5.
The way the nimbus detects that the executors are not alive is by not seeing 
heartbeats updated in ZK.  There has to be some cause for the heartbeats not 
being updated.  Most likely one is that the worker process is dead.  Another 
one could be that the process is too busy Garbage Collecting, and so missed the 
timeout for updating the heartbeat.
Regarding Supervisor and Worker: I think it's normal for the worker to be able 
to live absent the presence of the supervisor, so that sounds like expected 
behavior.
What are your timeouts for the various heartbeats?
Also, when the worker dies you should see a log from the supervisor noticing it.

- Erik

On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:

Hi Erik
Thanks for your reply!  It's great to hear about real production usages. For 
our use case, we are really puzzled by the outcome so far. The initial 
investigation seems to indicate that workers don't die by themselves ( i 
actually tried killing the supervisor and the worker would continue running 
beyond 30 minutes).  
The sequence of events is like this:  supervisor immediately complains worker 
"still has not started" for a few seconds right after launching the worker 
process, then silent --> after 26 minutes, nimbus complains executors (related 
to the worker) "not alive" and started to reassign topology --> after another 
~500 milliseconds, the supervisor shuts down its worker --> other peer workers 
complain about netty issues. and the loop goes on.
Could you kindly tell me what version of zookeeper is used with 0.9.4? and how 
many nodes in the zookeeper cluster? 
I wonder if this is due to zookeeper issues.
Thanks a lot,Fang
 
On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <[email protected]> wrote:

Hey Fang,
Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm 0.9.4 
(with netty) at scale, in clusters on the order of 30+ nodes.
One of the challenges with storm is figuring out what the root cause is when 
things go haywire.  You'll wanna examine why the nimbus decided to restart your 
worker processes.  It would happen when workers die and the nimbus notices that 
storm executors aren't alive.  (There are logs in nimbus for this.)  Then 
you'll wanna dig into why the workers died by looking at logs on the worker 
hosts.
- Erik

On Thursday, June 11, 2015, Fang Chen <[email protected]> wrote:

We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried 0.9.5 yet 
but I don't see any significant differences there), and unfortunately we could 
not even have a clean run for over 30 minutes on a cluster of 5 high-end nodes. 
zookeeper is also set up on these nodes but on different disks.
I have huge troubles to give my data analytics topology a stable run. So I 
tried the simplest topology I can think of, just an emtpy bolt, no io except 
for reading from kafka queue. 
Just to report my latest testing on 0.9.4 with this empty bolt (kakfa topic 
partition=1, spout task #=1, bolt #=20 with field grouping, msg size=1k). After 
26 minutes, nimbus orders to kill the topology as it believe the topology is 
dead, then after another 2 minutes, another kill, then another after another 4 
minutes, and on and on. 

I can understand there might be issues in the coordination among nimbus, worker 
and executor (e.g., heartbeats). But are there any doable workarounds? I wish 
there are as so many of you are using it in production :-)
I deeply appreciate any suggestions that could even make my toy topology 
working!
Fang













  

Reply via email to