Re: Behaviour when a task in a running job fails
Hi Till, Awesome! Thanks a lot for your answer! Currently I don’t have much free time to help with the implementation/contribute to Flink, but for sure I will in a near future! Thanks again, Luís Alves On Wed, 09 Aug 2017 at 10:29 Till Rohrmann < mailto:Till Rohrmann> wrote: a, pre, code, a:link, body { word-wrap: break-word !important; } Hi Luis, Flink's default behaviour is to fail all tasks and then redeploying the tasks with a possible different task to slot mapping. However, we recently introduced the infrastructure for a more fine-grained recovery mechanism. With this infrastructure it will be possible to only restart depended tasks which are effectively the connected components wrt the job DAG. See FLINK-5869 for more information. Moreover, we are working on rescheduling tasks to slots which executed the task before (introducing a scheduling preference). That way we can benefit from state which is still kept on these machines instead of having to download it. So basically, what you are writing is correct but also not as easy as to implement. Some of the required functionality is already there and more will follow with the upcoming 1.4 release. if you are interested in helping out with the implementation then feel welcome! Cheers, Till On Wed, Aug 9, 2017 at 12:45 AM, Luis Alves < mailto:lmtjal...@gmail.com > wrote: > Hello, > > What happens when a task in a running job fails? Will all the current > executions of the job's tasks fail? Will all the slots being used by the > job tasks (failed and non-failed ones) be released. > > Assuming all the slots are released, wouldn’t it make sense to: > > 1. “stop” the non-failed tasks and keep them holding their slots. Where > “stop” may mean something like stop processing tuples from the input > streams. > > 2. Re-schedule the failed task to a new slot (imagine the task failed > because the Task Manager owning that slot failed) > > 3. Recover state to the last snapshot. > > 4. “re-start”. Where “re-start” means start processing tuples from the > input streams. > > Thanks, > > Luís Alves
Re: Behaviour when a task in a running job fails
Hi Luis, Flink's default behaviour is to fail all tasks and then redeploying the tasks with a possible different task to slot mapping. However, we recently introduced the infrastructure for a more fine-grained recovery mechanism. With this infrastructure it will be possible to only restart depended tasks which are effectively the connected components wrt the job DAG. See FLINK-5869 for more information. Moreover, we are working on rescheduling tasks to slots which executed the task before (introducing a scheduling preference). That way we can benefit from state which is still kept on these machines instead of having to download it. So basically, what you are writing is correct but also not as easy as to implement. Some of the required functionality is already there and more will follow with the upcoming 1.4 release. if you are interested in helping out with the implementation then feel welcome! Cheers, Till On Wed, Aug 9, 2017 at 12:45 AM, Luis Alveswrote: > Hello, > > What happens when a task in a running job fails? Will all the current > executions of the job's tasks fail? Will all the slots being used by the > job tasks (failed and non-failed ones) be released. > > Assuming all the slots are released, wouldn’t it make sense to: > > 1. “stop” the non-failed tasks and keep them holding their slots. Where > “stop” may mean something like stop processing tuples from the input > streams. > > 2. Re-schedule the failed task to a new slot (imagine the task failed > because the Task Manager owning that slot failed) > > 3. Recover state to the last snapshot. > > 4. “re-start”. Where “re-start” means start processing tuples from the > input streams. > > Thanks, > > Luís Alves
Behaviour when a task in a running job fails
Hello, What happens when a task in a running job fails? Will all the current executions of the job's tasks fail? Will all the slots being used by the job tasks (failed and non-failed ones) be released. Assuming all the slots are released, wouldn’t it make sense to: 1. “stop” the non-failed tasks and keep them holding their slots. Where “stop” may mean something like stop processing tuples from the input streams. 2. Re-schedule the failed task to a new slot (imagine the task failed because the Task Manager owning that slot failed) 3. Recover state to the last snapshot. 4. “re-start”. Where “re-start” means start processing tuples from the input streams. Thanks, Luís Alves