Thank you for pointing me in the right direction!
On 12/24/2013 2:39 PM, suman bharadwaj wrote:
Just one correction, I think NLineInputFormat won't fit your usecase.
I think you may have to write custom record reader and use
textinputformat and plug it in spark as show above.
Regards,
Suman
Phillip, if there are easily detectable line groups you might define your
own InputFormat. Alternatively you can consider using mapPartitions() to
get access to the entire data partition instead of row-at-a-time. You'd
still have to worry about what happens at the partition boundaries. A third
Hi Philip,
you can specify org.apache.hadoop.streaming.StreamInputFormat, which fit
for you. you just specify stream.recordreader.begin
and stream.recordreader.end, then this Reader can read the block records
between BEGIN and END each time.
On Wed, Dec 25, 2013 at 11:11 AM, Christopher Nguyen