I just recently had a discovery that my jobs were taking several hours to
completely because of excess shuffle spills. What I found was that when I
hit the high point where I didn't have enough memory for the shuffles to
store all of their file consolidations at once, it could spill so many
times that it causes my job's runtime to increase by orders of magnitude
(and sometimes fail altogether).

I've played with all the tuning parameters I can find. To speed the
shuffles up, I tuned the akka threads to different values. I also tuned the
shuffle buffering a tad (both up and down).

I feel like I see a weak point here. The mappers are sharing memory space
with reducers and the shuffles need enough memory to consolidate and pull
otherwise they will need to spill and spill and spill. What i've noticed
about my jobs is that this is a difference between them taking 30 minutes
and 4 hours or more. Same job- just different memory tuning.

I've found that, as a result of the spilling, I'm better off not caching
any data in memory and lowering my storage fraction to 0 and still hoping I
was able to give my shuffles enough memory that my data doesn't
continuously spill. Is this the way it's supposed to be? It makes it hard
because it seems like it forces the memory limits on my job- otherwise it
could take orders of magnitude longer to execute.

Reply via email to