Hi Philip, you can specify org.apache.hadoop.streaming.StreamInputFormat, which fit for you. you just specify stream.recordreader.begin and stream.recordreader.end, then this Reader can read the block records between BEGIN and END each time.
On Wed, Dec 25, 2013 at 11:11 AM, Christopher Nguyen <[email protected]> wrote: > Phillip, if there are easily detectable line groups you might define your > own InputFormat. Alternatively you can consider using mapPartitions() to > get access to the entire data partition instead of row-at-a-time. You'd > still have to worry about what happens at the partition boundaries. A third > approach is indeed to pre-process with an appropriate mapper/reducer. > > Sent while mobile. Pls excuse typos etc. > I have a file that consists of multi-line records. Is it possible to read > in multi-line records with a method such as SparkContext.newAPIHadoopFile? > Or do I need to pre-process the data so that all the data for one element > is in a single line? > > Thanks, > Philip > >
