Re: Using HQL is terribly slow: Potential Performance Issue

Jerry Lam Thu, 10 Jul 2014 17:34:34 -0700

Hi Michael,

Yes the table is partitioned on 1 column. There are 11 columns in the table
and they are all String type.

I understand that SerDes contributes to some overheads but using pure Hive,
we could run the query about 5 times faster than SparkSQL. Given that Hive
also has the same SerDes overhead, then there must be something additional
that SparkSQL adds to the overall overheads that Hive doesn't have.

Best Regards,

Jerry

On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust <[email protected]>
wrote:

> On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam <[email protected]> wrote:
>
>> For the curious mind, the dataset is about 200-300GB and we are using 10
>> machines for this benchmark. Given the env is equal between the two
>> experiments, why pure spark is faster than SparkSQL?
>>
>
> There is going to be some overhead to parsing data using the Hive SerDes
> instead of the native Spark code, however, the slow down you are seeing
> here is much larger than I would expect. Can you tell me more about the
> table?  What does the schema look like?  Is it partitioned?
>
> By the way, I also try hql("select * from m").count. It is terribly slow
>> too.
>
>
> FYI, this query is actually identical to the one where you write out
> COUNT(*).
>

Re: Using HQL is terribly slow: Potential Performance Issue

Reply via email to