Re: multi-line elements

2013-12-24 Thread Philip Ogren
Thank you for pointing me in the right direction! On 12/24/2013 2:39 PM, suman bharadwaj wrote: Just one correction, I think NLineInputFormat won't fit your usecase. I think you may have to write custom record reader and use textinputformat and plug it in spark as show above. Regards, Suman

Re: multi-line elements

2013-12-24 Thread Christopher Nguyen
Phillip, if there are easily detectable line groups you might define your own InputFormat. Alternatively you can consider using mapPartitions() to get access to the entire data partition instead of row-at-a-time. You'd still have to worry about what happens at the partition boundaries. A third

Re: multi-line elements

2013-12-24 Thread Azuryy Yu
Hi Philip, you can specify org.apache.hadoop.streaming.StreamInputFormat, which fit for you. you just specify stream.recordreader.begin and stream.recordreader.end, then this Reader can read the block records between BEGIN and END each time. On Wed, Dec 25, 2013 at 11:11 AM, Christopher Nguyen