Hello, I have a few tasks in a stage with lots of tasks that have a large amount of shuffle spill.
I scouted the web to understand shuffle spill, and I did not find any simple explanation of the spill mechanism. What I put together is: 1. the shuffle spill can happens when the shuffle is written on disk (i.e. by the last "map" stage, as opposed to when the shuffle is read by the "reduce" stage) 2. the reason it happens is when it has a lot to write in the shuffle, and since that shuffle needs to be sorted by key, the spilling mechanism allows Spark to do that I am unclear however if a large task will systematically lead to shuffle spill, or if the number of keys (for the next reduce stage) that particular task encounters has also an impact. Concretely: Let's say I have: val ab = RDD[(a,b)] val ac = RDD[(a,c)] val bd = RDD[(b,d)] and I do: val bc = ab.join(ac).values // we investigate this task, triggered by values val cd = bc.join(bd).values The task we investigate reads from a previous shuffle, and will write to another shuffle to prepare for the second join. I know that I have data skew on a key on "a", meaning a few tasks are expected to be large and I have stragglers. Now, is that the cause of the shuffle spill, or is it because those straggler tasks also happen to have in their midst a very large amount of distinct "b"s? Thanks