I definitely think so. Network transfer is often a bottleneck for distributed jobs, especially if you're using groupBys or re-keying things often.
What network speed do you have between each HDFS node? 1GB? On Fri, Jan 3, 2014 at 2:34 PM, Debasish Das <[email protected]>wrote: > Hi, > > I have HDFS and MapReduce running on 20 nodes and a experimental spark > cluster running on subset of the HDFS node (say 8 of them). > > If some ETL is done using MR most likely the data will be replicated > across all 20 nodes (assuming I used all the nodes). > > Is it a good idea to run spark cluster on all 20 nodes where HDFS is > running so that all the RDDs are data local and the bulk data transfer is > minimized ? > > Thanks. > Deb >
