Re: Confusing behavior of newAPIHadoopFile

Sean Owen Mon, 28 Jul 2014 04:53:23 -0700

Oh, you literally mean these are different lines, not the structure of a line.

You can't solve this in general by reading the entire file into one
string. If the input is tens of gigabytes you will probably exhaust
memory on any of your machines. (Or, you might as well not bother with
Spark then.)

Do you really mean you want the strings that aren't "!!"? that's just
a filter operation. But as I understand you need an RDD of complex
data structures, containing many fields and key-value pairs across
many lines.

This is a difficult format to work with since Hadoop assumes a line is
a record, which is very common, but your records span lines.

If you have many small files, you could use wholeTextFiles to read
entire small text files as a string value, and simply parse it with a
Scala function as normal. That's fine as long as none of the files are
huge.

You can try mapPartitions for larger files, where you can parse an
Iterator[String] instead of a String at a time and combine results
from across lines into an Iterator[YourRecordType]. This would work as
long as Hadoop does not break a file into several partitions, but not
quite if a partition break occurs in your record. If you're willing to
tolerate missing some records here and there, it is a fine scalable
way to do it.

On Mon, Jul 28, 2014 at 12:43 PM, chang cheng <[email protected]> wrote:
> Nop.
>
> My input file's format is:
> !!
> string1
> string2
> !!
> string3
> string4
>
> sc.textFile("path) will return RDD("!!", "string1", "string2", "!!",
> "string3", "string4")
>
> what we need now is to transform this rdd to RDD("string1", "string2",
> "string3", "string4")
>
> your solution may not handle this.
>
>
>
> -----
> Senior in Tsinghua Univ.
> github: http://www.github.com/uronce-cc
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Confusing-behavior-of-newAPIHadoopFile-tp10764p10777.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Confusing behavior of newAPIHadoopFile

Reply via email to