Hello I am new to Apache Spark and am looking for some close guidance or collaboration for my Spark Project which has the following main components:
1. Writing scripts for automated setup of a multi-node cluster for Apache Spark with Hadoop File System (HDFS). This is required since I don’t have a fixed set of machines to run my Spark experiments and hence, need an easy, quick and automated way to do the entire Spark setup. 2. Writing scripts for simple SQL queries which read input from HDFS, run the SQL queries on the multi-node spark cluster and store the output in HDFS. 3. Generating detailed profiling results such as latency, shuffled data size for every task/operator in the SQL query and generating graphs for the same. Happy to discuss in more detail. Thanks Dhruv dh...@umn.edu <mailto:dh...@umn.edu> -------------------------------------------------- Dhruv Kumar PhD Candidate Computer Science and Engineering University of Minnesota www.dhruvkumar.me <http://dhruvkumar.me/>