Thanks yangyu for launching this discussion.

I really like this proposal. We ever found this scene frequently that some long 
tail tasks to delay the total batch job execution time in production.
We also have some thoughts for bringing this mechanism. Looking forward to your 
detail design doc, then we can discussion further. 

Best,
Zhijiang
------------------------------------------------------------------
发件人:Tao Yangyu <ryantao...@gmail.com>
发送时间:2018年11月6日(星期二) 11:01
收件人:dev <dev@flink.apache.org>
主 题:[DISCUSS] Task speculative execution for Flink batch

Hi everyone,

We propose task speculative execution for Flink batch in this message as
follows.

In the batch mode, the job is usually divided into multiple parallel tasks
executed cross many nodes in the cluster. It is common to encounter the
performance degradation on some nodes due to hardware problems or accident
I/O busy and high CPU load. This kind of degradation can probably cause the
running tasks on the node to be quite slow that is so called long tail
tasks. Although the long tail tasks will not fail, they can severely affect
the total job running time. Flink task scheduler does not take this long
tail problem into account currently.



Here we propose the speculative execution strategy to handle the problem.
The basic idea is to run a copy of task on another node when the original
task is identified to be long tail. In more details, the speculative task
will be triggered when the scheduler detects that the data processing
throughput of a task is much slower than others. The speculative task is
executed in parallel with the original one and share the same failure retry
mechanism. Once either task complete, the scheduler admits its output as
the final result and cancel the other running one. The preliminary
experiments has demonstrated the effectiveness.


The detailed design doc will be ready soon.  Your reviews and comments will
be much appreciated.


Thanks!

Ryan

Reply via email to