I am testing a failure recovery for topology and I am seeing an issue where a recovered task is not receiving any new tuples after it has been recovered.
The topology has 3 bolts, with 1 task each at this point. Initially i was using default Storm configuration, and with 4 workers, as such each task was running in it's own worker, I was seeing that when the 3rd task failed the following occurred: - Supervisor notices that task has failed and restarts the worker (supervisor.worker.timeout.secs = 30) - Whilst worker is restarting, nimbus notices that task has failed ( nimbus.task.timeout.secs=30) so sets it's status to disallowed and allocates the task to another worker - Supervisor kills the worker that is still restarting (as it's status has been set to disallowed) - Supervisor allocates failed task to the same worker as the task for bolt 2 is running - Supervisor restarts failed worker - When failed worker is restarted nimbus notices and then reallocates the failed task back to it - Supervisor kills tasks for bolts 2 and 3 and assigns them back to their original workers The fact that the failed task is restarted multiple times, and the that the task for bolt 2 is restarted, even though it didn't fail, are both big problems for what I am trying to achieve. When the task for bolt 3 fails, I would like it to be restarted on the same worker, even if that means waiting for a little while if the worker needs to be restarted. I tried changing nimbus.task.timeout.secs to 120 seconds. Now I can see that the supervisor notices that the task has failed and restarts it no problem, and that the nimbus does not set the worker status to disallowed or allocate the task elsewhere. However when the task comes back up again, it doesn't receive any tuples. I can see in the logs that tuples are being emitted (on both direct and non-direct streams) from the task for bolt 2, but they are never received at task 3 after it has been recovered. Also note that I am not using Storm's acking functionality so these lost tuples are not retried from the spout. Does anyone have any ideas what the issue might be? I was wondering if it could be something to do with the routing, or that zookeeper doesn't know about task being restarted but since there is nothing in the logs I'm a at a loss to progress. TIA
