That won't work.  First, parallelize is a SparkContext method called on
collections present in your driver process, not an RDD method.  An RDD is
already a parallel collection, so there is no need to parallelize it.
 Second, where do your input files reside?  It makes a big difference
whether they are just regular files local to your driver, or in a network
filesystem accessible to the worker nodes, or already in a
Hadoop-compatible distributed filesystem like HDFS.


On Thu, Oct 10, 2013 at 10:10 PM, Ramkumar Chokkalingam <
[email protected]> wrote:

> Thanks both for you time. To make it clear before I start off -
>
> From my input folder,
> Read all the filenames into a Spark RDD, say InputFilesRDD
> Call InputFilesRDD.parallelize() on that collection [which would split my
> input data filenames among various clusters]
> outputRDD = InputFilesRDD.foreach(filename => {Read the file [from local
> disk ?] and parse})
> write the output(outputRDD) to Hadoop DFS using Hadoop API.
>
> So, in this pipeline -> my input will be in my local disk[read from] and
> only while writing , I write[output] to Hadoop FileSystem as multiple files
> ?
>
> I find some Hadoop API's under 
> JavaSparkContext<http://spark.incubator.apache.org/docs/0.6.1/api/core/spark/api/java/JavaSparkContext.html>
>  and
> a dedicated Hadoop API 
> NewHadoopRDD<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.NewHadoopRDD>
>  .
> Is this what you are were referring to ?
>
>
>
>
>

Reply via email to