Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type.
I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust <mich...@databricks.com> wrote: > On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> For the curious mind, the dataset is about 200-300GB and we are using 10 >> machines for this benchmark. Given the env is equal between the two >> experiments, why pure spark is faster than SparkSQL? >> > > There is going to be some overhead to parsing data using the Hive SerDes > instead of the native Spark code, however, the slow down you are seeing > here is much larger than I would expect. Can you tell me more about the > table? What does the schema look like? Is it partitioned? > > By the way, I also try hql("select * from m").count. It is terribly slow >> too. > > > FYI, this query is actually identical to the one where you write out > COUNT(*). >