Hello,
I have some confusion about the task distribution and execution in Spark
standalone mode.
(1) When we read in a local file by SparkContext.textFile and do some
map/reduce job on it, how will spark decide to send data to which worker
node? Will the data be divided/partitioned equally according to the number
of worker node and each worker node get one piece of data?
(2) If we read in data via HDFS, how will the above process work?
(3) SparkContext.textFile has a parameter 'minSplits'. Is it used for
dividing data of input file into 'minSplits' partitions? Then how do we
know each worker node receive how many pertitions?
(4) We can set up spark.default.parallelism for system property. Is this
parameter applied on each worker node? Say, each worker node have 8 cores,
if we set spark.default.parallelism=32, then each core will need to deal
with 4 tasks? We can also set up spark_worker_instances in spark-env.sh.
For the same worker node, if we set up spark_worker_instances=8,
spark_worker_cores=1, spark.default.parallelism=32, then each core will
still be sent 4 tasks? Will the performance of the whole system be
different in these two situations?
(5) Will each map/reduce job be counted as one task? For example,
sc.parallelize([0,1,2,3]).map(lambda x: x) Will there be four tasks?

Any help will be appreciated.
Thanks!




-- 
--

Shangyu, Luo
Department of Computer Science
Rice University

Reply via email to