Re: Output to a single directory with multiple files rather multiple directories ?

Mark Hamstra Fri, 11 Oct 2013 13:13:56 -0700

If your input files are already in HDFS, then parallelizing the parsing or
other transformation of their contents in Spark should be easy -- that's
just the way the system works.  So you should end up with something like:


val inputFilenames: List[String] = ...whatever you need to do to generate a
list of your filenames as hdfs: URIs
inputFilenames.foreach { filename =>
  val lines =  sc.textFile(filename)
  lines.map(line => parsingFunction(line)).
    saveAsHadoopFile(...)
    // or saveAsNewApiHadoopFile; either way, I'm making the assumption
that your
    // parsingFunction has resulted in an RDD[(K, V)] for some K and V
}

The parsingFunction will automatically be applied in parallel across the
splits of your HDFS files.  If that isn't generating enough parallelism for
you, then you can use the optional minSplits parameter of textFile() to
produce more splits: e.g., sc.textFile(filename, 16).




On Fri, Oct 11, 2013 at 12:17 PM, Ramkumar Chokkalingam <
[email protected]> wrote:

> Thanks for the recommendation,Mark.
>
> I have Setup Hadoop and was using the HDFS to run my MR jobs, hence I
> assume it wouldn't take much of time to start using them from Spark code.I
> can write scripts to move them to HDFS before running my spark code.
> Since, You suggested I don't need to call parallelize() on any object,
> should I go with the following approach,
>
> *Reading input from HDFS as a file each,*
> * output = Parse the file *
> *Writing the output to a HFS file using HADOOP API*
> * Repeat the process for all input files*
>
> Should this be the pipeline I must be following, given that my input files
> are ~4MB each, and I process(parse) a file each Where/How does the
> parallelization (of my parsing )happens ?
>
>
>

Re: Output to a single directory with multiple files rather multiple directories ?

Reply via email to