So I'm doing some more reading of the ParallelCollectionRDD code.

It stores the data as a plain old list. Does this also mean that the data 
backing ParallelCollectionRDDs are always stored on the driver's heap as 
opposed to being distributed to the cluster? Can this behavior be overridden 
without explicitly saving the RDD to disk?

-Matt Cheah

From: Andrew Winings <[email protected]<mailto:[email protected]>>
Date: Friday, November 1, 2013 3:51 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Cc: Mingyu Kim <[email protected]<mailto:[email protected]>>
Subject: Out of memory building RDD on local[N]

Hi everyone,

I'm trying to run a task where I accumulate a ~1.5GB RDD with the spark url as 
local[8]. No matter what I do to this RDD, telling it to persist with 
StorageLevel.DISK_ONLY, or un-persisting the RDD altogether, it's always 
causing the JVM to run out of heap space.

I'm building the RDD in "batches". I'm building up a Java collection of 500,000 
items, then using context.parallelize() on that collection; call this RDD 
currBatchRDD. Then, I perform an RDD.union on the previous-batch RDD 
(prevBatchRDD) and the parallelized collection. Then I set prevBatchRDD to this 
union result, and so on. I clear this Java collection, and continue from there.

I would expect that, both locally and with an actual Spark cluster, that 
StorageLevel configurations would be respected for keeping RDDs on-heap or 
off-heap. However, my memory profile shows that the entire RDD is being 
collected on-heap in the local case. Am I misunderstanding the documentation?

Thanks,

-Matt Cheah

Reply via email to