That won't work. First, parallelize is a SparkContext method called on collections present in your driver process, not an RDD method. An RDD is already a parallel collection, so there is no need to parallelize it. Second, where do your input files reside? It makes a big difference whether they are just regular files local to your driver, or in a network filesystem accessible to the worker nodes, or already in a Hadoop-compatible distributed filesystem like HDFS.
On Thu, Oct 10, 2013 at 10:10 PM, Ramkumar Chokkalingam < [email protected]> wrote: > Thanks both for you time. To make it clear before I start off - > > From my input folder, > Read all the filenames into a Spark RDD, say InputFilesRDD > Call InputFilesRDD.parallelize() on that collection [which would split my > input data filenames among various clusters] > outputRDD = InputFilesRDD.foreach(filename => {Read the file [from local > disk ?] and parse}) > write the output(outputRDD) to Hadoop DFS using Hadoop API. > > So, in this pipeline -> my input will be in my local disk[read from] and > only while writing , I write[output] to Hadoop FileSystem as multiple files > ? > > I find some Hadoop API's under > JavaSparkContext<http://spark.incubator.apache.org/docs/0.6.1/api/core/spark/api/java/JavaSparkContext.html> > and > a dedicated Hadoop API > NewHadoopRDD<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.NewHadoopRDD> > . > Is this what you are were referring to ? > > > > >
