On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
> For the curious mind, the dataset is about 200-300GB and we are using 10
> machines for this benchmark. Given the env is equal between the two
> experiments, why pure spark is faster than SparkSQL?
>

There is going to be some overhead to parsing data using the Hive SerDes
instead of the native Spark code, however, the slow down you are seeing
here is much larger than I would expect. Can you tell me more about the
table?  What does the schema look like?  Is it partitioned?

By the way, I also try hql("select * from m").count. It is terribly slow
> too.


FYI, this query is actually identical to the one where you write out
COUNT(*).

Reply via email to