We are on Storm 0.8.2 and have a fairly simple topology with one spout and 9 bolts on two worker nodes. The spout reads from a Azure storage queue via the Azure API and the final bolt posts data to a REST API endpoint.
Things work great for a while but over time messages are being dropped between the bolts meaning the difference between the “Emitted” count of one bolt and the “Executed” count of the next bolt in the processing chain is getting larger and larger to the point where no messages make it all the way through the system. When we restart the topology everything is fine again for a day or two. This would indicate a resource issue but the servers look very healthy (memory/CPU never go over 50%). We recently stopped acking messages but that did not make a difference. We also tried to switch to Storm 0.9.0.1 (trying both Netty and 0MQ) but that did not help either. We are pretty much out of ideas. What are we missing? Any suggestions are very much appreciated! Thanks, Markus
