Actually, I'm working with a binary format. The api allows reading out a single record at a time, but I'm not sure how to get those records into spark (without reading everything into memory from a single file at once).
On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.tw...@gmail.com> wrote: > file => tranform file into a bunch of records > > > What does this function do exactly? Does it load the file locally? > Spark supports RDDs exceeding global RAM (cf the terasort example), but if > your example just loads each file locally, then this may cause problems. > Instead, you should load each file into an rdd with context.textFile(), > flatmap that and union these rdds. > > also see > > http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files > > > On 1 December 2014 at 16:50, Keith Simmons <ke...@pulse.io> wrote: > >> This is a long shot, but... >> >> I'm trying to load a bunch of files spread out over hdfs into an RDD, and >> in most cases it works well, but for a few very large files, I exceed >> available memory. My current workflow basically works like this: >> >> context.parallelize(fileNames).flatMap { file => >> tranform file into a bunch of records >> } >> >> I'm wondering if there are any APIs to somehow "flush" the records of a >> big dataset so I don't have to load them all into memory at once. I know >> this doesn't exist, but conceptually: >> >> context.parallelize(fileNames).streamMap { (file, stream) => >> for every 10K records write records to stream and flush >> } >> >> Keith >> > >