Hey Jerry, When you ran these queries using different methods, did you see any discrepancy in the returned results (i.e. the counts)?
On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust <mich...@databricks.com> wrote: > Yeah, sorry. I think you are seeing some weirdness with partitioned tables > that I have also seen elsewhere. I've created a JIRA and assigned someone at > databricks to investigate. > > https://issues.apache.org/jira/browse/SPARK-2443 > > > On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam <chiling...@gmail.com> wrote: >> >> Hi Michael, >> >> Yes the table is partitioned on 1 column. There are 11 columns in the >> table and they are all String type. >> >> I understand that SerDes contributes to some overheads but using pure >> Hive, we could run the query about 5 times faster than SparkSQL. Given that >> Hive also has the same SerDes overhead, then there must be something >> additional that SparkSQL adds to the overall overheads that Hive doesn't >> have. >> >> Best Regards, >> >> Jerry >> >> >> >> On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >>> >>> On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam <chiling...@gmail.com> wrote: >>>> >>>> For the curious mind, the dataset is about 200-300GB and we are using 10 >>>> machines for this benchmark. Given the env is equal between the two >>>> experiments, why pure spark is faster than SparkSQL? >>> >>> >>> There is going to be some overhead to parsing data using the Hive SerDes >>> instead of the native Spark code, however, the slow down you are seeing here >>> is much larger than I would expect. Can you tell me more about the table? >>> What does the schema look like? Is it partitioned? >>> >>>> By the way, I also try hql("select * from m").count. It is terribly slow >>>> too. >>> >>> >>> FYI, this query is actually identical to the one where you write out >>> COUNT(*). >> >> >