Hi guys I am using CDH 5.3.3 and that comes with Hive 0.13.1 and Spark 1.2 So to answer your question its not Tez (that I believe comes with HortonWorks) This Hive query was run with hive defaults. I used additional hive params right now to improve the timingsSET mapreduce.job.reduces=16;SET mapreduce.tasktracker.map.tasks.maximum=24;SET mapreduce.tasktracker.reduce.tasks.maximum=16;SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;SET mapreduce.map.output.compress=true;
Now Time taken: 140.139 seconds, Fetched: 29597 row(s)(surprisingly close to spark-sql now LOL. Time to tweak spark-sql now) EARLIER RESULTS Hive – 326.021 seconds, Fetched: 29597 row(s) Impala – Fetched 27625 row(s) in 17.02s spark-sql – Time taken: 120.236 seconds I don't have the bandwidth to manage individual components on the cluster :-) since I am solo doing all this and delivering ML solutions to production LOL.So I depend on distribution such as CDH. The downside is that one is always couple of versions behind. Thanks for your questions. regards sanjay From: Michael Armbrust <mich...@databricks.com> To: user <user@spark.apache.org> Sent: Thursday, June 18, 2015 3:25 PM Subject: Re: Spark-sql versus Impala versus Hive I would also love to see a more recent version of Spark SQL. There have been a lot of performance improvements between 1.2 and 1.4 :) On Thu, Jun 18, 2015 at 3:18 PM, Steve Nunez <snu...@hortonworks.com> wrote: Interesting. What where the Hive settings? Specifically it would be useful to know if this was Hive on Tez. - Steve From: Sanjay Subramanian Reply-To: Sanjay Subramanian Date: Thursday, June 18, 2015 at 11:08 To: "user@spark.apache.org" Subject: Spark-sql versus Impala versus Hive I just published results of my findings herehttps://bigdatalatte.wordpress.com/2015/06/18/spark-sql-versus-impala-versus-hive/