Hello all, I am new to Spark and have been working on a small project trying to tackle the straggler problems. I ran some SQL queries (GROUPBY) on a small cluster and observed that some tasks take several minutes while others finish in seconds.
I know that Spark already has speculation mode but I still see this problem with speculative mode turned on. Therefore, I modified the code to kill those stragglers instead of re-executing them, trading accuracy for speed. As expected, killing stragglers will cause system hang due to the lost tasks. Can anyone give some guidance on getting this to work? Is it possible to early terminate some tasks without affecting the overall execution of the job, with some cost of accuracy? Appreciate your help! -- Jia Zhan