Re: Fine Grained Recovery / FLIP-1

2019-07-26 Thread Thomas Weise
Guowei and Stephan, thanks for the reply! The biggest gain that FLIP-1 will deliver for streaming is that parallel processing can continue accept for those parallel paths affected by the failure, even when all tasks in an affected path need to be reset. Assuming task manager process failure as mos

Re: Fine Grained Recovery / FLIP-1

2019-07-26 Thread Stephan Ewen
Hi Thomas! For Batch, this should be working in release 1.9. For streaming, it is a bit more tricky, mainly because of the fact that you have to deal with downstream correctness. Either a recovery still needs to reset downstream tasks (which means on average half of the tasks) or needs to wait be

Re: Fine Grained Recovery / FLIP-1

2019-07-25 Thread Guowei Ma
Hi, 1. Currently, much work in FLINK-4256 is about failover improvements in the bouded dataset scenario. 2. For the streaming scenario, a new shuffle plugin + proper failover strategy could avoid the "stop-the-word" recovery. 3. We have already done many works about the new shuffle in the old Flin

Fine Grained Recovery / FLIP-1

2019-07-25 Thread Thomas Weise
Hi, We are using Flink for streaming and find the "stop-the-world" recovery behavior of Flink prohibitive for use cases that prioritize availability. Partial recovery as outlined in FLIP-1 would probably alleviate these concerns. https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+G