Hi all,

We are migrating from Hive to Spark. We used Spark-SQL CLI to run our
Hive Queries for performance testing. I am new to Spark and had few
clarifications. We have :


1. Set up 10 boxes, one master and 9 slaves in standalone mode. Each
of the boxes are launchers to our external Hadoop grid.
2. Copied hive-site.xml to spark conf. The hive metastore uri is
external to our spark cluster.
3. Use spark-sql CLI to submit direct hive queries from the master
host. Our Hive Queries hit hive tables on the remote hdfs cluster
which are in ORC format.


Questions :

1. What are the sequence of steps involved from the time a HQL is
submitted to execution of query in spark cluster ?
2. Was an RDD created to read OrcFile from remote hdfs ? Did it get
the storage information from the hive metastore ?
3. Since hdfs cluster is remote from spark cluster, how is data
locality obtained here ?
4. Does running queries in Spark-SQL CLI and access remote hive
metastore incur any cost in query performance ?
5. In Spark SQL programming guide , it is mentioned , Spark-SQL CLI is
only for local mode. What does this mean ? We were able to submit 100s
of queries using the CLI. Is there any downside to this approach ?
6. Is it possible to create one hivecontext, add all udf jar once and
submit 100 queries with the same hive context ?

Thanks
Narayanan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to