Hi everyone, I'm trying to run a task where I accumulate a ~1.5GB RDD with the spark url as local[8]. No matter what I do to this RDD, telling it to persist with StorageLevel.DISK_ONLY, or un-persisting the RDD altogether, it's always causing the JVM to run out of heap space.
I'm building the RDD in "batches". I'm building up a Java collection of 500,000 items, then using context.parallelize() on that collection; call this RDD currBatchRDD. Then, I perform an RDD.union on the previous-batch RDD (prevBatchRDD) and the parallelized collection. Then I set prevBatchRDD to this union result, and so on. I clear this Java collection, and continue from there. I would expect that, both locally and with an actual Spark cluster, that StorageLevel configurations would be respected for keeping RDDs on-heap or off-heap. However, my memory profile shows that the entire RDD is being collected on-heap in the local case. Am I misunderstanding the documentation? Thanks, -Matt Cheah
