[jira] [Resolved] (FLINK-1668) Add a config option to specify delays between restarts

Stephan Ewen (JIRA) Tue, 10 Mar 2015 01:47:06 -0700

     [ 
https://issues.apache.org/jira/browse/FLINK-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stephan Ewen resolved FLINK-1668.
---------------------------------
    Resolution: Implemented

Implemented in abbb0a93ca67da17197dc5372e6d95edd8149d44

> Add a config option to specify delays between restarts
> ------------------------------------------------------
>
>                 Key: FLINK-1668
>                 URL: https://issues.apache.org/jira/browse/FLINK-1668
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 0.9
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>             Fix For: 0.9
>
>
> The system currently introduces a short delay between a failed task execution 
> and the restarted execution.
> The reason is that this delay seemed to help in letting problems surface that 
> let to the failed task. As an example, if a TaskManager fails, tasks fail due 
> to data transfer errors. The TaskManager is not immediately recognized as 
> failed, though (takes a bit until heartbeats time out). Immediately 
> re-deploying tasks has a very high chance of assigning work to the 
> TaskManager that is actually not responding, causing the execution retry to 
> fail again. The delay gives the system time to figure out that the 
> TaskManager was lost and does not take it into account upon the retry.
> Currently, the system uses the heartbeat timeout as the default delay value. 
> This may make sense as a default value for critical task failures, but is 
> actually quite high for other types of failures.
> In any case, I would like to add an option for users to specify the delay 
> (even set it to 0, if desired).
> The delay is not the best solution, in my opinion, we should eventually move 
> to something better. Ideas are to put TaskManagers responsible for failed 
> tasks in a "probationary" mode until they have reported back that everything 
> is good (still alive, disk space available, etc)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (FLINK-1668) Add a config option to specify delays between restarts

Reply via email to