The first line is distributing your fileList variable in the cluster as a RDD, 
partitioned using the default partitioner settings (e.g. Number of cores in 
your cluster).

Each of your workers would one or more slices of data (depending on how many 
cores each executor has) and the abstraction is called partition.

What is your use case? If you want to load the files and continue processing in 
parallel, then a simple .map should work.
If you want to execute arbitrary code based on the list of files that each 
executor received, then you need to use .foreach that will get executed for 
each of the entries, on the worker.

-adrian

From: Vinoth Sankar
Date: Wednesday, October 28, 2015 at 2:49 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: How do I parallize Spark Jobs at Executor Level.

Hi,

I'm reading and filtering large no of files using Spark. It's getting 
parallized at Spark Driver level only. How do i make it parallelize to 
Executor(Worker) Level. Refer the following sample. Is there any way to 
paralleling iterate the localIterator ?

Note : I use Java 1.7 version

JavaRDD<String> files = javaSparkContext.parallelize(fileList)
Iterator<String> localIterator = files.toLocalIterator();

Regards
Vinoth Sankar

Reply via email to