Hi everyone,

I'm trying to run a task where I accumulate a ~1.5GB RDD with the spark url as 
local[8]. No matter what I do to this RDD, telling it to persist with 
StorageLevel.DISK_ONLY, or un-persisting the RDD altogether, it's always 
causing the JVM to run out of heap space.

I'm building the RDD in "batches". I'm building up a Java collection of 500,000 
items, then using context.parallelize() on that collection; call this RDD 
currBatchRDD. Then, I perform an RDD.union on the previous-batch RDD 
(prevBatchRDD) and the parallelized collection. Then I set prevBatchRDD to this 
union result, and so on. I clear this Java collection, and continue from there.

I would expect that, both locally and with an actual Spark cluster, that 
StorageLevel configurations would be respected for keeping RDDs on-heap or 
off-heap. However, my memory profile shows that the entire RDD is being 
collected on-heap in the local case. Am I misunderstanding the documentation?

Thanks,

-Matt Cheah

Reply via email to