We are having an issue with our storm topology. The topology runs fine for some time having a high ack rate (above 10000k/10 mins) then the ack rate drops to below 1000/10 mins and it starts failing messages due to them reaching the topology timeout. The storm logs don't contain any exceptions or other information as to what is going on.
Looking at the logging from our bolts it looks like all messages that fail reach a certain bolt and then stop being processed by any bolts that follow. The bolt being emitted to is the first using a FieldsGrouping in the topology. Our setup is a 3 node cluster with one topology. max_spout_pending is set to a sensible value for our setup.
