1.) How about if data is in S3  and we cached in memory , instead of hdfs ?
2.) How is the numbers of reducers determined in both case .

Even if I specify set.mapred.reduce.tasks=50, still somehow reducers
allocated are only 2, instead of 50. Although query/tasks gets completed.

Regards,
Arpit





On Mon, Apr 21, 2014 at 9:33 AM, Patrick Wendell <pwend...@gmail.com> wrote:

> For a HadoopRDD, first the spark scheduler calculates the number of tasks
> based on input splits. Usually people use this with HDFS data so in that
> case it's based on HDFS blocks. If the HDFS datanodes are co-located with
> the Spark cluster then it will try to run the tasks on the data node that
> contains its input to achieve higher throughput. Otherwise, all of the
> nodes are considered equally fit to run any task, and Spark just load
> balances across them.
>
>
> On Sat, Apr 19, 2014 at 9:25 PM, David Thomas <dt5434...@gmail.com> wrote:
>
>> During a Spark stage, how are tasks split among the workers? Specifically
>> for a HadoopRDD, who determines which worker has to get which task?
>>
>
>

Reply via email to