This is a long shot, but... I'm trying to load a bunch of files spread out over hdfs into an RDD, and in most cases it works well, but for a few very large files, I exceed available memory. My current workflow basically works like this:
context.parallelize(fileNames).flatMap { file => tranform file into a bunch of records } I'm wondering if there are any APIs to somehow "flush" the records of a big dataset so I don't have to load them all into memory at once. I know this doesn't exist, but conceptually: context.parallelize(fileNames).streamMap { (file, stream) => for every 10K records write records to stream and flush } Keith