Just one correction, I think NLineInputFormat won't fit your usecase. I think you may have to write custom record reader and use textinputformat and plug it in spark as show above.
Regards, Suman Bharadwaj S On Wed, Dec 25, 2013 at 2:51 AM, suman bharadwaj <[email protected]>wrote: > Even I'm new to spark. But I was able to write a custom sentence input > format and sentence record reader which reads multiple lines of text with > record boundary being *"[.?!]\s*"* using Hadoop APIs. And plugged in the > SentenceInputFormat into the spark api as shown below: > > val inputRead = sc.hadoopFile("<path to the file in hdfs>",classOf > *[SentenceTextInputFormat]*,classOf[LongWritable],classOf[Text]).map(value > =>value._2.toString) > > In your case, you can use the NLineInputFormat i guess which is provided > by hadoop. And pass it as a parameter. > > May be there are better ways to do it. > > Regards, > Suman Bharadwaj S > > > On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren <[email protected]>wrote: > >> I have a file that consists of multi-line records. Is it possible to >> read in multi-line records with a method such as >> SparkContext.newAPIHadoopFile? Or do I need to pre-process the data so >> that all the data for one element is in a single line? >> >> Thanks, >> Philip >> >> >
