Oh, you literally mean these are different lines, not the structure of a line.
You can't solve this in general by reading the entire file into one string. If the input is tens of gigabytes you will probably exhaust memory on any of your machines. (Or, you might as well not bother with Spark then.) Do you really mean you want the strings that aren't "!!"? that's just a filter operation. But as I understand you need an RDD of complex data structures, containing many fields and key-value pairs across many lines. This is a difficult format to work with since Hadoop assumes a line is a record, which is very common, but your records span lines. If you have many small files, you could use wholeTextFiles to read entire small text files as a string value, and simply parse it with a Scala function as normal. That's fine as long as none of the files are huge. You can try mapPartitions for larger files, where you can parse an Iterator[String] instead of a String at a time and combine results from across lines into an Iterator[YourRecordType]. This would work as long as Hadoop does not break a file into several partitions, but not quite if a partition break occurs in your record. If you're willing to tolerate missing some records here and there, it is a fine scalable way to do it. On Mon, Jul 28, 2014 at 12:43 PM, chang cheng <myai...@gmail.com> wrote: > Nop. > > My input file's format is: > !! > string1 > string2 > !! > string3 > string4 > > sc.textFile("path) will return RDD("!!", "string1", "string2", "!!", > "string3", "string4") > > what we need now is to transform this rdd to RDD("string1", "string2", > "string3", "string4") > > your solution may not handle this. > > > > ----- > Senior in Tsinghua Univ. > github: http://www.github.com/uronce-cc > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10777.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.