Hi, I have HDFS and MapReduce running on 20 nodes and a experimental spark cluster running on subset of the HDFS node (say 8 of them).
If some ETL is done using MR most likely the data will be replicated across all 20 nodes (assuming I used all the nodes). Is it a good idea to run spark cluster on all 20 nodes where HDFS is running so that all the RDDs are data local and the bulk data transfer is minimized ? Thanks. Deb
