Hi all, We are migrating from Hive to Spark. We used Spark-SQL CLI to run our Hive Queries for performance testing. I am new to Spark and had few clarifications. We have :
1. Set up 10 boxes, one master and 9 slaves in standalone mode. Each of the boxes are launchers to our external Hadoop grid. 2. Copied hive-site.xml to spark conf. The hive metastore uri is external to our spark cluster. 3. Use spark-sql CLI to submit direct hive queries from the master host. Our Hive Queries hit hive tables on the remote hdfs cluster which are in ORC format. Questions : 1. What are the sequence of steps involved from the time a HQL is submitted to execution of query in spark cluster ? 2. Was an RDD created to read OrcFile from remote hdfs ? Did it get the storage information from the hive metastore ? 3. Since hdfs cluster is remote from spark cluster, how is data locality obtained here ? 4. Does running queries in Spark-SQL CLI and access remote hive metastore incur any cost in query performance ? 5. In Spark SQL programming guide , it is mentioned , Spark-SQL CLI is only for local mode. What does this mean ? We were able to submit 100s of queries using the CLI. Is there any downside to this approach ? 6. Is it possible to create one hivecontext, add all udf jar once and submit 100 queries with the same hive context ? Thanks Narayanan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org