By the way, I also try hql("select * from m").count. It is terribly slow too.
On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Spark users and developers, > > I'm doing some simple benchmarks with my team and we found out a potential > performance issue using Hive via SparkSQL. It is very bothersome. So your > help in understanding why it is terribly slow is very very important. > > First, we have some text files in HDFS which are also managed by Hive as a > table called "m". There is nothing special about the table name "m". > > In pure spark way, I will just do the following to get a total number of > line of text files: > > scala> > sc.textFile("hdfs://namenode:8020/user/hive/warehouse/test.db/m/*").count > > This takes 2.7 minutes. > > If I use SparkSQL, I will do this: > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > import hiveContext._ > hql("use test") > hql("select count(*) from m").collect.foreach(println) > > This takes 11.9minutes! > > This is 4x slower than using pure spark. > > I wonder if anyone knows what causes the performance issue? > > For the curious mind, the dataset is about 200-300GB and we are using 10 > machines for this benchmark. Given the env is equal between the two > experiments, why pure spark is faster than SparkSQL? > > Best Regards, > > Jerry > > > >