[ https://issues.apache.org/jira/browse/FLINK-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephan Ewen resolved FLINK-1668. --------------------------------- Resolution: Implemented Implemented in abbb0a93ca67da17197dc5372e6d95edd8149d44 > Add a config option to specify delays between restarts > ------------------------------------------------------ > > Key: FLINK-1668 > URL: https://issues.apache.org/jira/browse/FLINK-1668 > Project: Flink > Issue Type: Improvement > Affects Versions: 0.9 > Reporter: Stephan Ewen > Assignee: Stephan Ewen > Fix For: 0.9 > > > The system currently introduces a short delay between a failed task execution > and the restarted execution. > The reason is that this delay seemed to help in letting problems surface that > let to the failed task. As an example, if a TaskManager fails, tasks fail due > to data transfer errors. The TaskManager is not immediately recognized as > failed, though (takes a bit until heartbeats time out). Immediately > re-deploying tasks has a very high chance of assigning work to the > TaskManager that is actually not responding, causing the execution retry to > fail again. The delay gives the system time to figure out that the > TaskManager was lost and does not take it into account upon the retry. > Currently, the system uses the heartbeat timeout as the default delay value. > This may make sense as a default value for critical task failures, but is > actually quite high for other types of failures. > In any case, I would like to add an option for users to specify the delay > (even set it to 0, if desired). > The delay is not the best solution, in my opinion, we should eventually move > to something better. Ideas are to put TaskManagers responsible for failed > tasks in a "probationary" mode until they have reported back that everything > is good (still alive, disk space available, etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)