Re: Using memory mapped file for shuffle

2015-04-29 Thread Sandy Ryza
Spark currently doesn't allocate any memory off of the heap for shuffle objects. When the in-memory data gets too large, it will write it out to a file, and then merge spilled filed later. What exactly do you mean by store shuffle data in HDFS? -Sandy On Tue, Apr 14, 2015 at 10:15 AM, Kannan Ra

Re: Using memory mapped file for shuffle

2015-04-14 Thread Kannan Rajah
Sandy, Can you clarify how it won't cause OOM? Is it anyway to related to memory being allocated outside the heap - native space? The reason I ask is that I have a use case to store shuffle data in HDFS. Since there is no notion of memory mapped files, I need to store it as a byte buffer. I want to

Re: Using memory mapped file for shuffle

2015-04-14 Thread Sandy Ryza
Hi Kannan, Both in MapReduce and Spark, the amount of shuffle data a task produces can exceed the tasks memory without risk of OOM. -Sandy On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid wrote: > That limit doesn't have anything to do with the amount of available > memory. Its just a tuning par

Re: Using memory mapped file for shuffle

2015-04-14 Thread Imran Rashid
That limit doesn't have anything to do with the amount of available memory. Its just a tuning parameter, as one version is more efficient for smaller files, the other is better for bigger files. I suppose the comment is a little better in FileSegmentManagedBuffer: https://github.com/apache/spark