Hello, I have some confusion about the task distribution and execution in Spark standalone mode. (1) When we read in a local file by SparkContext.textFile and do some map/reduce job on it, how will spark decide to send data to which worker node? Will the data be divided/partitioned equally according to the number of worker node and each worker node get one piece of data? (2) If we read in data via HDFS, how will the above process work? (3) SparkContext.textFile has a parameter 'minSplits'. Is it used for dividing data of input file into 'minSplits' partitions? Then how do we know each worker node receive how many pertitions? (4) We can set up spark.default.parallelism for system property. Is this parameter applied on each worker node? Say, each worker node have 8 cores, if we set spark.default.parallelism=32, then each core will need to deal with 4 tasks? We can also set up spark_worker_instances in spark-env.sh. For the same worker node, if we set up spark_worker_instances=8, spark_worker_cores=1, spark.default.parallelism=32, then each core will still be sent 4 tasks? Will the performance of the whole system be different in these two situations? (5) Will each map/reduce job be counted as one task? For example, sc.parallelize([0,1,2,3]).map(lambda x: x) Will there be four tasks?
Any help will be appreciated. Thanks! -- -- Shangyu, Luo Department of Computer Science Rice University
