Re: multi-line elements

Azuryy Yu Tue, 24 Dec 2013 21:53:01 -0800

Hi Philip,
you can specify org.apache.hadoop.streaming.StreamInputFormat, which fit
for you. you just specify stream.recordreader.begin
and stream.recordreader.end, then this Reader can read the block records
between BEGIN and END each time.



On Wed, Dec 25, 2013 at 11:11 AM, Christopher Nguyen <[email protected]> wrote:

> Phillip, if there are easily detectable line groups you might define your
> own InputFormat. Alternatively you can consider using mapPartitions() to
> get access to the entire data partition instead of row-at-a-time. You'd
> still have to worry about what happens at the partition boundaries. A third
> approach is indeed to pre-process with an appropriate mapper/reducer.
>
> Sent while mobile. Pls excuse typos etc.
> I have a file that consists of multi-line records.  Is it possible to read
> in multi-line records with a method such as SparkContext.newAPIHadoopFile?
>  Or do I need to pre-process the data so that all the data for one element
> is in a single line?
>
> Thanks,
> Philip
>
>

Re: multi-line elements

Reply via email to