Re: Behaviour when a task in a running job fails

2017-08-09 Thread Luis Alves
Hi Till,

Awesome! Thanks a lot for your answer! 

Currently I don’t have much free time to help with the 
implementation/contribute to Flink, but for sure I will in a near future!

Thanks again, 

Luís Alves

On Wed, 09 Aug 2017 at 10:29 Till Rohrmann

<
mailto:Till Rohrmann 
> wrote:

a, pre, code, a:link, body { word-wrap: break-word !important; }

Hi Luis,

Flink's default behaviour is to fail all tasks and then redeploying the

tasks with a possible different task to slot mapping.

However, we recently introduced the infrastructure for a more fine-grained

recovery mechanism. With this infrastructure it will be possible to only

restart depended tasks which are effectively the connected components wrt

the job DAG. See FLINK-5869 for more information.

Moreover, we are working on rescheduling tasks to slots which executed the

task before (introducing a scheduling preference). That way we can benefit

from state which is still kept on these machines instead of having to

download it.

So basically, what you are writing is correct but also not as easy as to

implement. Some of the required functionality is already there and more

will follow with the upcoming 1.4 release. if you are interested in helping

out with the implementation then feel welcome!

Cheers,

Till

On Wed, Aug 9, 2017 at 12:45 AM, Luis Alves <
mailto:lmtjal...@gmail.com
> wrote:

> Hello,

>

> What happens when a task in a running job fails? Will all the current

> executions of the job's tasks fail? Will all the slots being used by the

> job tasks (failed and non-failed ones) be released.

>

> Assuming all the slots are released, wouldn’t it make sense to:

>

> 1. “stop” the non-failed tasks and keep them holding their slots. Where

> “stop” may mean something like stop processing tuples from the input

> streams.

>

> 2. Re-schedule the failed task to a new slot (imagine the task failed

> because the Task Manager owning that slot failed)

>

> 3. Recover state to the last snapshot.

>

> 4. “re-start”. Where “re-start” means start processing tuples from the

> input streams.

>

> Thanks,

>

> Luís Alves

Re: Behaviour when a task in a running job fails

2017-08-09 Thread Till Rohrmann
Hi Luis,

Flink's default behaviour is to fail all tasks and then redeploying the
tasks with a possible different task to slot mapping.

However, we recently introduced the infrastructure for a more fine-grained
recovery mechanism. With this infrastructure it will be possible to only
restart depended tasks which are effectively the connected components wrt
the job DAG. See FLINK-5869 for more information.

Moreover, we are working on rescheduling tasks to slots which executed the
task before (introducing a scheduling preference). That way we can benefit
from state which is still kept on these machines instead of having to
download it.

So basically, what you are writing is correct but also not as easy as to
implement. Some of the required functionality is already there and more
will follow with the upcoming 1.4 release. if you are interested in helping
out with the implementation then feel welcome!

Cheers,
Till

On Wed, Aug 9, 2017 at 12:45 AM, Luis Alves  wrote:

> Hello,
>
> What happens when a task in a running job fails? Will all the current
> executions of the job's tasks fail? Will all the slots being used by the
> job tasks (failed and non-failed ones) be released.
>
> Assuming all the slots are released, wouldn’t it make sense to:
>
> 1. “stop” the non-failed tasks and keep them holding their slots. Where
> “stop” may mean something like stop processing tuples from the input
> streams.
>
> 2. Re-schedule the failed task to a new slot (imagine the task failed
> because the Task Manager owning that slot failed)
>
> 3. Recover state to the last snapshot.
>
> 4. “re-start”. Where “re-start” means start processing tuples from the
> input streams.
>
> Thanks,
>
> Luís Alves


Behaviour when a task in a running job fails

2017-08-08 Thread Luis Alves
Hello,

What happens when a task in a running job fails? Will all the current 
executions of the job's tasks fail? Will all the slots being used by the job 
tasks (failed and non-failed ones) be released.

Assuming all the slots are released, wouldn’t it make sense to: 

1. “stop” the non-failed tasks and keep them holding their slots. Where “stop” 
may mean something like stop processing tuples from the input streams.

2. Re-schedule the failed task to a new slot (imagine the task failed because 
the Task Manager owning that slot failed)

3. Recover state to the last snapshot.

4. “re-start”. Where “re-start” means start processing tuples from the input 
streams.

Thanks, 

Luís Alves