Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Bijay Pathak
Hello,

I am running  TeraSort https://github.com/ehiggs/spark-terasort on 100GB
of data. The final metrics I am getting on Shuffle Spill are:

Shuffle Spill(Memory): 122.5 GB
Shuffle Spill(Disk): 3.4 GB

What's the difference and relation between these two metrics? Does these
mean 122.5 GB was spill from memory during the shuffle?

thank you,
bijay


Re: Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Sandy Ryza
Hi Bijay,

The Shuffle Spill (Disk) is the total number of bytes written to disk by
records spilled during the shuffle.  The Shuffle Spill (Memory) is the
amount of space the spilled records occupied in memory before they were
spilled.  These differ because the serialized format is more compact, and
the on-disk version can be compressed as well.

-Sandy

On Mon, Mar 23, 2015 at 5:29 PM, Bijay Pathak bijay.pat...@cloudwick.com
wrote:

 Hello,

 I am running  TeraSort https://github.com/ehiggs/spark-terasort on
 100GB of data. The final metrics I am getting on Shuffle Spill are:

 Shuffle Spill(Memory): 122.5 GB
 Shuffle Spill(Disk): 3.4 GB

 What's the difference and relation between these two metrics? Does these
 mean 122.5 GB was spill from memory during the shuffle?

 thank you,
 bijay