Hello I am new to spark world and started to explore recently in standalone mode. It would be great if I get clarifications on below doubts-
1. Driver locality - It is mentioned in documentation that "client" deploy-mode is not good if machine running "spark-submit" is not co-located with worker machines. cluster mode is not available with standalone clusters. Therefore, do we have to submit all applications on master machine? (Assuming we don't have separate co-located gateway machine) 2. How does above driver locality work with spark shell running on local machine ? 3. I am little confused with role of driver program. Does driver do any computation in spark app life cycle? For instance, in simple row count app, worker node calculates local row counts. Does driver sum up local row counts? In short where does reduce phase runs in this case? 4. In case of accessing hdfs data over network, do worker nodes read data in parallel? How does hdfs data over network get accessed in spark application? Sorry if these questions were already discussed.. Thanks Swapnil