I have created a very simple topology to demonstrate an issue that we have been seeing for months, which is we seem to be leaking messages consistently after force killing a worker in testing scenarios. This is reproducible for us every time, as long as there are multiple workers in play, and I kill a worker that is executing bolts but not spouts. The effect is a pause in throughput followed by a consistent percentage of spout failures due to timeout, throughput drops while the messages wait to timeout, then this repeats over and over indefinitely. Failed messages eventually replay and succeed which is the strange part, almost as if the message goes down another non-leaky path and succeeds. This is infamously known here as the "glug effect".
I am fairly sure it has something to do with configuration or else it could be a bug in storm. I was hoping that this was the issue (Contention in Disruptor Queue) https://issues.apache.org/jira/browse/STORM-342 But alas the patch did not help. I am happy to explain the choices for these configurations, the large timeout settings are mainly because in production some of our bolts have high execute latency and take awhile to initialize. Here is the simple topology code to run, which emits tuples constantly to a consumer bolt. https://gist.github.com/anonymous/9f4e0ae972d9f1f4e7bf ## Storm configuration nimbus.task.launch.secs: 300 nimbus.task.timeout.secs: 300 supervisor.monitor.frequency.secs: 15 supervisor.worker.start.timeout.secs: 300 supervisor.worker.timeout.secs: 120 worker.heartbeat.frequency.secs: 15 storm.zookeeper.session.timeout: 1000000 storm.messaging.netty.max_retries: 30 storm.messaging.netty.max_wait_ms: 20000 ## End Configuration Thank you for your time. Luke Forehand | Networked Insights | Software Engineer
