Thank you for pointing me in the right direction!

On 12/24/2013 2:39 PM, suman bharadwaj wrote:
Just one correction, I think NLineInputFormat won't fit your usecase. I think you may have to write custom record reader and use textinputformat and plug it in spark as show above.

Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 2:51 AM, suman bharadwaj <[email protected] <mailto:[email protected]>> wrote:

    Even I'm new to spark. But I was able to write a custom sentence
    input format and sentence record reader which reads multiple lines
    of text with record boundary being *"[.?!]\s*"* using Hadoop APIs.
    And plugged in the SentenceInputFormat into the spark api as shown
    below:

    val inputRead = sc.hadoopFile("<path to the file in
    
hdfs>",classOf*[SentenceTextInputFormat]*,classOf[LongWritable],classOf[Text]).map(value
    =>value._2.toString)

    In your case, you can use the NLineInputFormat i guess which is
    provided by hadoop. And pass it as a parameter.

    May be there are better ways to do it.

    Regards,
    Suman Bharadwaj S


    On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren
    <[email protected] <mailto:[email protected]>> wrote:

        I have a file that consists of multi-line records.  Is it
        possible to read in multi-line records with a method such as
        SparkContext.newAPIHadoopFile?  Or do I need to pre-process
        the data so that all the data for one element is in a single line?

        Thanks,
        Philip




Reply via email to