Hi, all:
I have a hadoop file containing fields seperated by "!!", like below:
!!
field1
key1 value1
key2 value2
!!
field2
key3 value3
key4 value4
!!
I want to read the file into a pair in TextInputFormat, specifying delimiter
as "!!"
First, I tried the following code:
val hadoopConf = new Configuration()
hadoopConf.set("textinputformat.record.delimiter", "!!\n")
val path = args(0)
val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat],
classOf[LongWritable], classOf[Text], hadoopConf)
rdd.take(3).foreach(println)
Far from expectation, the result is:
(120,)
(120,)
(120,)
According to my experimentation, "120" is the byte offset of the last field
separated by "!!"
After digging into spark source code, I find "textFileInput" is implemented
as:
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable],
classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
So, I modified my initial code into: (bold text is the modification)
val hadoopConf = new Configuration()
hadoopConf.set("textinputformat.record.delimiter", "!!\n")
val path = args(0)
val rdd = sc.newAPIHadoopFile(path, classOf[TextInputFormat],
classOf[LongWritable], classOf[Text], hadoopConf).*map(pair =>
pair._2.toString)*
rdd.take(3).foreach(println)
Then, the results are:
filed1
key1 value1
key2 value2
field2
....
As expected.
I'm confused by the first code snippet's behavior.
Hope you can offer an explanation. Thanks!
-----
Senior in Tsinghua Univ.
github: http://www.github.com/uronce-cc
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.