I definitely think so.  Network transfer is often a bottleneck for
distributed jobs, especially if you're using groupBys or re-keying things
often.

What network speed do you have between each HDFS node?  1GB?


On Fri, Jan 3, 2014 at 2:34 PM, Debasish Das <[email protected]>wrote:

> Hi,
>
> I have HDFS and MapReduce running on 20 nodes and a experimental spark
> cluster running on subset of the HDFS node (say 8 of them).
>
> If some ETL is done using MR most likely the data will be replicated
> across all 20 nodes (assuming I used all the nodes).
>
> Is it a good idea to run spark cluster on all 20 nodes where HDFS is
> running so that all the RDDs are data local and the bulk data transfer is
> minimized ?
>
> Thanks.
> Deb
>

Reply via email to