1.) How about if data is in S3 and we cached in memory , instead of hdfs ?
2.) How is the numbers of reducers determined in both case .
Even if I specify set.mapred.reduce.tasks=50, still somehow reducers
allocated are only 2, instead of 50. Although query/tasks gets completed.
Regards,
Arpit
For a HadoopRDD, first the spark scheduler calculates the number of tasks
based on input splits. Usually people use this with HDFS data so in that
case it's based on HDFS blocks. If the HDFS datanodes are co-located with
the Spark cluster then it will try to run the tasks on the data node that
cont
During a Spark stage, how are tasks split among the workers? Specifically
for a HadoopRDD, who determines which worker has to get which task?