Spark currently doesn't allocate any memory off of the heap for shuffle
objects. When the in-memory data gets too large, it will write it out to a
file, and then merge spilled filed later.
What exactly do you mean by store shuffle data in HDFS?
-Sandy
On Tue, Apr 14, 2015 at 10:15 AM, Kannan Ra
Sandy,
Can you clarify how it won't cause OOM? Is it anyway to related to memory
being allocated outside the heap - native space? The reason I ask is that I
have a use case to store shuffle data in HDFS. Since there is no notion of
memory mapped files, I need to store it as a byte buffer. I want to
Hi Kannan,
Both in MapReduce and Spark, the amount of shuffle data a task produces can
exceed the tasks memory without risk of OOM.
-Sandy
On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid wrote:
> That limit doesn't have anything to do with the amount of available
> memory. Its just a tuning par
That limit doesn't have anything to do with the amount of available
memory. Its just a tuning parameter, as one version is more efficient for
smaller files, the other is better for bigger files. I suppose the comment
is a little better in FileSegmentManagedBuffer:
https://github.com/apache/spark