Even I'm new to spark. But I was able to write a custom sentence input
format and sentence record reader which reads multiple lines of text with
record boundary being *"[.?!]\s*"* using Hadoop APIs. And plugged in the
SentenceInputFormat into the spark api as shown below:

val inputRead = sc.hadoopFile("<path to the file in hdfs>",classOf
*[SentenceTextInputFormat]*,classOf[LongWritable],classOf[Text]).map(value
=>value._2.toString)

In your case, you can use the NLineInputFormat i guess which is provided by
hadoop. And pass it as a parameter.

May be there are better ways to do it.

Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren <[email protected]>wrote:

> I have a file that consists of multi-line records.  Is it possible to read
> in multi-line records with a method such as SparkContext.newAPIHadoopFile?
>  Or do I need to pre-process the data so that all the data for one element
> is in a single line?
>
> Thanks,
> Philip
>
>

Reply via email to