Hi,

I have HDFS and MapReduce running on 20 nodes and a experimental spark
cluster running on subset of the HDFS node (say 8 of them).

If some ETL is done using MR most likely the data will be replicated across
all 20 nodes (assuming I used all the nodes).

Is it a good idea to run spark cluster on all 20 nodes where HDFS is
running so that all the RDDs are data local and the bulk data transfer is
minimized ?

Thanks.
Deb

Reply via email to