Thanks both for you time. To make it clear before I start off -
>From my input folder,
Read all the filenames into a Spark RDD, say InputFilesRDD
Call InputFilesRDD.parallelize() on that collection [which would split my
input data filenames among various clusters]
outputRDD = InputFilesRDD.foreach(filename => {Read the file [from local
disk ?] and parse})
write the output(outputRDD) to Hadoop DFS using Hadoop API.
So, in this pipeline -> my input will be in my local disk[read from] and
only while writing , I write[output] to Hadoop FileSystem as multiple files
?
I find some Hadoop API's under
JavaSparkContext<http://spark.incubator.apache.org/docs/0.6.1/api/core/spark/api/java/JavaSparkContext.html>
and
a dedicated Hadoop API
NewHadoopRDD<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.NewHadoopRDD>
.
Is this what you are were referring to ?