Re: multi-line elements

suman bharadwaj Tue, 24 Dec 2013 13:41:02 -0800

Just one correction, I think NLineInputFormat won't fit your usecase. I
think you may have to write custom record reader and use textinputformat
and plug it in spark as show above.


Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 2:51 AM, suman bharadwaj <[email protected]>wrote:

> Even I'm new to spark. But I was able to write a custom sentence input
> format and sentence record reader which reads multiple lines of text with
> record boundary being *"[.?!]\s*"* using Hadoop APIs. And plugged in the
> SentenceInputFormat into the spark api as shown below:
>
> val inputRead = sc.hadoopFile("<path to the file in hdfs>",classOf
> *[SentenceTextInputFormat]*,classOf[LongWritable],classOf[Text]).map(value
> =>value._2.toString)
>
> In your case, you can use the NLineInputFormat i guess which is provided
> by hadoop. And pass it as a parameter.
>
> May be there are better ways to do it.
>
> Regards,
> Suman Bharadwaj S
>
>
> On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren <[email protected]>wrote:
>
>> I have a file that consists of multi-line records.  Is it possible to
>> read in multi-line records with a method such as
>> SparkContext.newAPIHadoopFile?  Or do I need to pre-process the data so
>> that all the data for one element is in a single line?
>>
>> Thanks,
>> Philip
>>
>>
>

Re: multi-line elements

Reply via email to