Guowei and Stephan, thanks for the reply!
The biggest gain that FLIP-1 will deliver for streaming is that parallel
processing can continue accept for those parallel paths affected by the
failure, even when all tasks in an affected path need to be reset. Assuming
task manager process failure as mos
Hi Thomas!
For Batch, this should be working in release 1.9.
For streaming, it is a bit more tricky, mainly because of the fact that you
have to deal with downstream correctness.
Either a recovery still needs to reset downstream tasks (which means on
average half of the tasks) or needs to wait be
Hi,
1. Currently, much work in FLINK-4256 is about failover improvements in the
bouded dataset scenario.
2. For the streaming scenario, a new shuffle plugin + proper failover
strategy could avoid the "stop-the-word" recovery.
3. We have already done many works about the new shuffle in the old Flin
Hi,
We are using Flink for streaming and find the "stop-the-world" recovery
behavior of Flink prohibitive for use cases that prioritize availability.
Partial recovery as outlined in FLIP-1 would probably alleviate these
concerns.
https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+G