Yep, that's definitely possible. It's one of the workarounds I was considering. I was just curious if there was a simpler (and perhaps more efficient) approach.
Keith On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg <andy.tw...@gmail.com> wrote: > Could you modify your function so that it streams through the files record > by record and outputs them to hdfs, then read them all in as RDDs and take > the union? That would only use bounded memory. > > On 1 December 2014 at 17:19, Keith Simmons <ke...@pulse.io> wrote: > >> Actually, I'm working with a binary format. The api allows reading out a >> single record at a time, but I'm not sure how to get those records into >> spark (without reading everything into memory from a single file at once). >> >> >> >> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.tw...@gmail.com> wrote: >> >>> file => tranform file into a bunch of records >>> >>> >>> What does this function do exactly? Does it load the file locally? >>> Spark supports RDDs exceeding global RAM (cf the terasort example), but >>> if your example just loads each file locally, then this may cause problems. >>> Instead, you should load each file into an rdd with context.textFile(), >>> flatmap that and union these rdds. >>> >>> also see >>> >>> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files >>> >>> >>> On 1 December 2014 at 16:50, Keith Simmons <ke...@pulse.io> wrote: >>> >>>> This is a long shot, but... >>>> >>>> I'm trying to load a bunch of files spread out over hdfs into an RDD, >>>> and in most cases it works well, but for a few very large files, I exceed >>>> available memory. My current workflow basically works like this: >>>> >>>> context.parallelize(fileNames).flatMap { file => >>>> tranform file into a bunch of records >>>> } >>>> >>>> I'm wondering if there are any APIs to somehow "flush" the records of a >>>> big dataset so I don't have to load them all into memory at once. I know >>>> this doesn't exist, but conceptually: >>>> >>>> context.parallelize(fileNames).streamMap { (file, stream) => >>>> for every 10K records write records to stream and flush >>>> } >>>> >>>> Keith >>>> >>> >>> >> >