I am testing a failure recovery for topology and I am seeing an issue where
a recovered task is not receiving any new tuples after it has been
recovered.

The topology has 3 bolts, with 1 task each at this point.

Initially i was using default Storm configuration, and with 4 workers, as
such each task was running in it's own worker, I was seeing that when the
3rd task failed the following occurred:
- Supervisor notices that task has failed and restarts the worker
(supervisor.worker.timeout.secs
= 30)
- Whilst worker is restarting, nimbus notices that task has failed (
nimbus.task.timeout.secs=30) so sets it's status to disallowed and
allocates the task to another worker
- Supervisor kills the worker that is still restarting (as it's status has
been set to disallowed)
- Supervisor allocates failed task to the same worker as the task for bolt
2 is running
- Supervisor restarts failed worker
- When failed worker is restarted nimbus notices and then reallocates the
failed task back to it
- Supervisor kills tasks for bolts 2 and 3 and assigns them back to their
original workers

The fact that the failed task is restarted multiple times, and the that the
task for bolt 2 is restarted, even though it didn't fail, are both big
problems for what I am trying to achieve.
When the task for bolt 3 fails, I would like it to be restarted on the same
worker, even if that means waiting for a little while if the worker needs
to be restarted.

I tried changing nimbus.task.timeout.secs to 120 seconds. Now I can see
that the supervisor notices that the task has failed and restarts it no
problem, and that the nimbus does not set the worker status to disallowed
or allocate the task elsewhere. However when the task comes back up again,
it doesn't receive any tuples. I can see in the logs that tuples are being
emitted (on both direct and non-direct streams) from the task for bolt 2,
but they are never received at task 3 after it has been recovered. Also
note that I am not using Storm's acking functionality so these lost tuples
are not retried from the spout.

Does anyone have any ideas what the issue might be? I was wondering if it
could be something to do with the routing, or that zookeeper doesn't know
about task being restarted but since there is nothing in the logs I'm a at
a loss to progress.

TIA

Reply via email to