Thank you, Till! The current (in progress) implementation is considering also the problem related to losing the task's slots of the failed node(s), something related to [2] ?
[2] https://issues.apache.org/jira/browse/FLINK-3047 Best, Ovidiu > On 22 Feb 2016, at 18:13, Till Rohrmann <trohrm...@apache.org> wrote: > > Hi Ovidiu, > > at the moment Flink's batch fault tolerance restarts the whole job in case of > a failure. However, parts of the logic to do partial backtracking such as > intermediate result partitions and the backtracking algorithm are already > implemented or exist as a PR [1]. So we hope to complete the partial > backtracking soon. > > [1] https://github.com/apache/flink/pull/640 > <https://github.com/apache/flink/pull/640> > > Cheers, > Till > > On Mon, Feb 22, 2016 at 6:00 PM, Ovidiu-Cristian MARCU > <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> > wrote: > Hi > > In case of failure of a node what does it mean 'Fault tolerance for programs > in the DataSet API works by retrying failed executions’ [1] ? > -work already done by the rest of the nodes is not lost, only work of the > lost node is recomputed, job execution will continue > or > -entire job execution is retried > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html > > <https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html> > > Best, > Ovidiu >