Confusing behavior of newAPIHadoopFile

chang cheng Mon, 28 Jul 2014 01:03:07 -0700

Hi, all:

I have a hadoop file containing fields seperated by "!!", like below:
!!
field1
key1 value1
key2 value2
!!
field2
key3 value3
key4 value4
!!


I want to read the file into a pair in TextInputFormat, specifying delimiter
as "!!"

First, I tried the following code:

    val hadoopConf = new Configuration()
    hadoopConf.set("textinputformat.record.delimiter", "!!\n")

    val path = args(0)
    val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat],
      classOf[LongWritable], classOf[Text], hadoopConf)

    rdd.take(3).foreach(println)

Far from expectation, the result is:

    (120,)
    (120,)
    (120,)

According to my experimentation, "120" is the byte offset of the last field
separated by "!!"

After digging into spark source code, I find "textFileInput" is implemented
as:

     hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable],
classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)

So, I modified my initial code into: (bold text is the modification)

    val hadoopConf = new Configuration()
    hadoopConf.set("textinputformat.record.delimiter", "!!\n")

    val path = args(0)
    val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat],
      classOf[LongWritable], classOf[Text], hadoopConf).*map(pair =>
pair._2.toString)*

    rdd.take(3).foreach(println)

Then, the results are:

    filed1
    key1 value1
    key2 value2

    field2
    ....
As expected.

I'm confused by the first code snippet's behavior. 
Hope you can offer an explanation. Thanks!



-----
Senior in Tsinghua Univ.
github: http://www.github.com/uronce-cc
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Confusing behavior of newAPIHadoopFile

Reply via email to