Hello,

I have a few tasks in a stage with lots of tasks that have a large amount
of shuffle spill.

I scouted the web to understand shuffle spill, and I did not find any
simple explanation of the spill mechanism. What I put together is:

1. the shuffle spill can happens when the shuffle is written on disk (i.e.
by the last "map" stage, as opposed to when the shuffle is read by the
"reduce" stage)
2. the reason it happens is when it has a lot to write in the shuffle, and
since that shuffle needs to be sorted by key, the spilling mechanism allows
Spark to do that

I am unclear however if a large task will systematically lead to shuffle
spill, or if the number of keys (for the next reduce stage) that particular
task encounters has also an impact.

Concretely:
Let's say I have:
val ab = RDD[(a,b)]
val ac = RDD[(a,c)]
val bd = RDD[(b,d)]

and I do:
val bc = ab.join(ac).values // we investigate this task, triggered by values
val cd = bc.join(bd).values

The task we investigate reads from a previous shuffle, and will write to
another shuffle to prepare for the second join. I know that I have data
skew on a key on "a", meaning a few tasks are expected to be large and I
have stragglers.

Now, is that the cause of the shuffle spill, or is it because those
straggler tasks also happen to have in their midst a very large amount of
distinct "b"s?

Thanks

Reply via email to