basic topology timing out consistently after worker kill

Luke Forehand Fri, 13 Jun 2014 13:13:32 -0700

I have created a very simple topology to demonstrate an issue that we have
been seeing for months, which is we seem to be leaking messages
consistently after force killing a worker in testing scenarios.  This is
reproducible for us every time, as long as there are multiple workers in
play, and I kill a worker that is executing bolts but not spouts.  The
effect is a pause in throughput followed by a consistent percentage of
spout failures due to timeout, throughput drops while the messages wait to
timeout, then this repeats over and over indefinitely.  Failed messages
eventually replay and succeed which is the strange part, almost as if the
message goes down another non-leaky path and succeeds.  This is infamously
known here as the "glug effect".


I am fairly sure it has something to do with configuration or else it
could be a bug in storm.  I was hoping that this was the issue (Contention
in Disruptor Queue) https://issues.apache.org/jira/browse/STORM-342 But
alas the patch did not help.  I am happy to explain the choices for these
configurations, the large timeout settings are mainly because in
production some of our bolts have high execute latency and take awhile to
initialize.


Here is the simple topology code to run, which emits tuples constantly to
a consumer bolt.

https://gist.github.com/anonymous/9f4e0ae972d9f1f4e7bf



## Storm configuration
nimbus.task.launch.secs: 300
nimbus.task.timeout.secs: 300

supervisor.monitor.frequency.secs: 15
supervisor.worker.start.timeout.secs: 300
supervisor.worker.timeout.secs: 120

worker.heartbeat.frequency.secs: 15


storm.zookeeper.session.timeout: 1000000
storm.messaging.netty.max_retries: 30
storm.messaging.netty.max_wait_ms: 20000



## End Configuration

Thank you for your time.

Luke Forehand |  Networked Insights  |  Software Engineer

basic topology timing out consistently after worker kill

Reply via email to