Thanks for the recommendation,Mark. I have Setup Hadoop and was using the HDFS to run my MR jobs, hence I assume it wouldn't take much of time to start using them from Spark code.I can write scripts to move them to HDFS before running my spark code. Since, You suggested I don't need to call parallelize() on any object, should I go with the following approach,
*Reading input from HDFS as a file each,* * output = Parse the file * *Writing the output to a HFS file using HADOOP API* * Repeat the process for all input files* Should this be the pipeline I must be following, given that my input files are ~4MB each, and I process(parse) a file each Where/How does the parallelization (of my parsing )happens ?
