[jira] [Commented] (FLINK-4256) Fine-grained recovery

2019-08-19 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910534#comment-16910534
 ] 

Stephan Ewen commented on FLINK-4256:
-

@Thomas - I think the approach to manually split jobs with a pubsub is an 
option. It could be interesting to add some tooling for that.

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Major
> Fix For: 1.9.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2019-08-18 Thread Thomas Weise (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910168#comment-16910168
 ] 

Thomas Weise commented on FLINK-4256:
-

Thanks for the clarification, this is excellent news. Perhaps clarify that on 
FLIP-1? Also, until the even finer grained recovery of streaming jobs becomes 
available, it may be possible for users to decompose a pipeline into smaller 
segments with intermediate pubsub topics if partial availability across a 
shuffle step is needed.

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Major
> Fix For: 1.9.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2019-08-18 Thread Stephan Ewen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909980#comment-16909980
 ] 

Stephan Ewen commented on FLINK-4256:
-

This is in fact working for streaming as well, not only for batch. It works for 
both on the granularity of "pipelined regions".

However, with blocking "batch" shuffles, a batch job decomposes into many small 
pipelined regions, which can be individually recovered. Streaming programs only 
decompose into multiple pipelined regions when they do not have an all-to-all 
shuffle ({{keyBy()}} or {{rebalance()}}).

Anything beyond that, like more fine grained recovery of streaming jobs is not 
in the scope here, because it would need a mechanism different from Flink's 
current checkpointing mechanism.

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Major
> Fix For: 1.9.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2019-08-16 Thread Thomas Weise (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909157#comment-16909157
 ] 

Thomas Weise commented on FLINK-4256:
-

I'm surprised to see this closed when there is still no support for streaming? 
Please see 
[https://lists.apache.org/thread.html/cdb315f4b71a915c4c598b580f71cad11bc3f5bd146b916378765f2a@%3Cdev.flink.apache.org%3E]

 

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Major
> Fix For: 1.9.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2017-11-10 Thread Sihua Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247272#comment-16247272
 ] 

Sihua Zhou commented on FLINK-4256:
---

Partial recovery seem to more useful for Batch job, for stream job is it also 
fine to place `local recovery` under this umbrella? [~StephanEwen]

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2017-09-13 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165537#comment-16165537
 ] 

Eron Wright  commented on FLINK-4256:
-

Is the scope of FLINK-4256 covering batch jobs only?   There are various TODOs 
related to the interplay between local failover and checkpointing, for example. 
  Wondering whether additional subtasks should be opened here or elsewhere.

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2017-03-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925615#comment-15925615
 ] 

ASF GitHub Bot commented on FLINK-4256:
---

GitHub user shuai-xu opened a pull request:

https://github.com/apache/flink/pull/3539

[FLINK-4256] Flip1: fine gained recovery

This is an informal pr for the implementation of flip1 version 1. 
It enable that when a task fail, only restart the minimal pipelined 
connected executions instead of the whole execution graph.
Main changes:
1. ExecutionGraph doesn't manage the failover any more, it only record the 
finished JobVertex number and turn to FINISHED when all vertexes finish(maybe 
later FailoverCoordinator will take over this). Its state can only be CREATED, 
RUNNING, FAILED, FINISHED or SUSPENDED now.
2. FailoverCoordinator will manage the failover now. It will generate 
several FailoverRegions when the EG is attached. It listens for the fail of 
executions. When an execution fail, it finds a FailoverRegion to finish the 
failover.
3. When JM need the EG to be canceled or failed, EG will also notice 
FailoverCoordinator, FailoverCoordinator will notice all FailoverRegions to 
cancel their executions and when all executions are canceled, 
FailoverCoordinator will notice EG to be CANCELED or FAILED.
4. FailoverCoordinator has server state, RUNNING, FAILING, CANCELLING, 
FAILED, CANCELED. 
5. FailoverRegion contains the minimal pipelined connected executions and 
manager the failover of them.
6. FailoverRegion has CREATED, RUNNING, CANCELLING, CANCELLED.
7. One FailoverRegion may be the succeeding or preceding of others. When a 
preceding region failover, its all succeedings should failover too. And the 
succeedings should just reset its executions and wait for the preceding to 
start it when preceding finish. Preceding should wait for its succeedings to be 
CREATED and then schedule again.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shuai-xu/flink jira-4256

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3539.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3539


commit 363f1536838064edbdd5f39e41f3f19f6c511fc4
Author: shuai.xus 
Date:   2017-03-15T03:36:11Z

[FLINK-4256] Flip1: fine gained recovery




> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> The detail desgin for version1 is 
> https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2016-08-09 Thread wenlong.lyu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413090#comment-15413090
 ] 

wenlong.lyu commented on FLINK-4256:


thanks for explaining, you are right about pre-computing. Still have another 
concern, I think it is quite a special case for a job to be ExecutionJobVertex 
level splittable, it may only happen in batch job graphs with blocking edges in 
practice.

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
> Fix For: 1.2.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2016-08-04 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407870#comment-15407870
 ] 

Stephan Ewen commented on FLINK-4256:
-

[~zjwang] Preventing downstream restarts would be a followup optimization.
In order to not make this issue here more complicated than it already is, I 
would first solve this, and then approach this as a separate followup.



> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
> Fix For: 1.2.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2016-08-04 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407866#comment-15407866
 ] 

Stephan Ewen commented on FLINK-4256:
-

[~wenlong.lwl] True, one has to find the entire "connected component" for 
restart. That one is, however, dynamic, so I would not pre-compute it:
  - We may introduce best-effort caching that means in some cases the program 
must backtrack further, in others less
  - Downstream canceling is only necessary if an input has already been 
supplied to the downstream task. Especially in batch, this is often not the 
case and can reduce the tasks to look at for canceling.

We can make this quite a bit more efficient in my opinion by operating on 
{{ExecutionJobVertex}} level for many cases, rather than on each individual 
vertex.

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
> Fix For: 1.2.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2016-08-03 Thread Zhijiang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405484#comment-15405484
 ] 

Zhijiang Wang commented on FLINK-4256:
--

In further improvement, if task c failed, the following downstream tasks like d 
and e should not be restarted. We already make some works related with it.
I am interested in the issue of caching intermediate result, it can solve the 
problem of restarting upstream tasks of failed one. Is it the 
PIPELINED_PERSISTENT type in result partition? Wish further plan for it.


> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
> Fix For: 1.2.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4256) Fine-grained recovery

2016-07-25 Thread Wenlong Lyu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391849#comment-15391849
 ] 

Wenlong Lyu commented on FLINK-4256:


hi, Stephan, we have implemented similar solution before, but simple back 
tracking and forward cannot work well in situation following:

Assuming job graph has A/B/C job vertices, A is connected to C with forward 
strategy, and B is connected to C all-to-all strategy, when a task of A failed, 
only one C task will be added to restart node set.

I suggesting divide the job graph in maximal connected sub-graphs treating the 
job graph as an undirected graph, when a job graph is submitted. Besides, when 
the job graph is in large scale because extracting related nodes according to a 
given node can be time costly and will be repeatedly used in long running, 
using sub-graphs can avoid the problem 

> Fine-grained recovery
> -
>
> Key: FLINK-4256
> URL: https://issues.apache.org/jira/browse/FLINK-4256
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
> Fix For: 1.2.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)