Thanks for the recommendation,Mark.

I have Setup Hadoop and was using the HDFS to run my MR jobs, hence I
assume it wouldn't take much of time to start using them from Spark code.I
can write scripts to move them to HDFS before running my spark code.
Since, You suggested I don't need to call parallelize() on any object,
should I go with the following approach,

*Reading input from HDFS as a file each,*
* output = Parse the file *
*Writing the output to a HFS file using HADOOP API*
* Repeat the process for all input files*

Should this be the pipeline I must be following, given that my input files
are ~4MB each, and I process(parse) a file each Where/How does the
parallelization (of my parsing )happens ?

Reply via email to