1.) How about if data is in S3 and we cached in memory , instead of hdfs ? 2.) How is the numbers of reducers determined in both case .
Even if I specify set.mapred.reduce.tasks=50, still somehow reducers allocated are only 2, instead of 50. Although query/tasks gets completed. Regards, Arpit On Mon, Apr 21, 2014 at 9:33 AM, Patrick Wendell <pwend...@gmail.com> wrote: > For a HadoopRDD, first the spark scheduler calculates the number of tasks > based on input splits. Usually people use this with HDFS data so in that > case it's based on HDFS blocks. If the HDFS datanodes are co-located with > the Spark cluster then it will try to run the tasks on the data node that > contains its input to achieve higher throughput. Otherwise, all of the > nodes are considered equally fit to run any task, and Spark just load > balances across them. > > > On Sat, Apr 19, 2014 at 9:25 PM, David Thomas <dt5434...@gmail.com> wrote: > >> During a Spark stage, how are tasks split among the workers? Specifically >> for a HadoopRDD, who determines which worker has to get which task? >> > >